Blockchain security firm OpenZeppelin has flagged methodological flaws and data contamination in its audit of OpenAI’s EVMbench, a new AI benchmark for smart contract security created with crypto investor Paradigm. EVMbench, launched in mid-February, was designed to test how well AI agents can identify, patch, and exploit smart contract vulnerabilities.
OpenZeppelin said it welcomed the benchmark but applied the same scrutiny it uses for projects it helps secure — including protocols like Aave, Lido and Uniswap — and found two main problems: training data contamination and incorrect vulnerability classifications, including several labeled high severity that are not exploitable in practice.
In its review, OpenZeppelin noted the benchmark’s dataset was drawn from 120 audits conducted between 2024 and mid-2025. Because many top-performing models have knowledge cutoffs around mid-2025, OpenZeppelin says those agents likely saw the same vulnerability reports during pretraining. Although EVMbench disconnected models from the internet during testing to prevent live searching, prior exposure would mean models might already “know” the answers, reducing the test’s ability to measure discovery of novel vulnerabilities. OpenZeppelin also raised concerns about the dataset’s limited size, which narrows the evaluation surface and amplifies contamination impacts.
OpenZeppelin additionally reported factual errors in the dataset. It identified at least four vulnerabilities that EVMbench classified as high severity but which OpenZeppelin says are not actually exploitable. Despite these issues, EVMbench had scored AI agents as if finding those invalid vulnerabilities were correct. OpenZeppelin emphasized these were not subjective disagreements over severity but cases where the described exploits do not work.
EVMbench’s initial leaderboard placed Anthropic’s Claude Open 4.6 first, followed by OpenAI’s OC-GPT-5.2 and Google’s Gemini 3 Pro. OpenZeppelin stressed that while AI will significantly affect smart contract security, the tools and the benchmarks used to build and evaluate them must be rigorously designed and validated. “The question isn’t whether AI will transform smart contract security — it will. The question is whether the data and benchmarks we use to build and evaluate these tools are held to the same standard as the contracts they’re meant to protect,” the firm wrote.
OpenZeppelin’s findings suggest EVMbench may need dataset revisions and methodological updates to ensure it truly measures AI capability to find previously unseen vulnerabilities rather than recall pretraining content or reward detection of invalid issues.
