OpenZeppelin, a blockchain security firm, has criticized methodological flaws and data contamination in its audit of EVMbench, an AI benchmark for smart contract security developed with crypto investor Paradigm. Launched in mid-February, EVMbench was built to evaluate how well AI agents can identify, patch, and exploit smart contract vulnerabilities. OpenZeppelin welcomed the effort but says the benchmark requires corrections to be a reliable measure of AI capability.
The firm identified two core problems. First, training data contamination: EVMbench’s dataset was compiled from 120 audits carried out between 2024 and mid-2025. Because many top models have pretraining cutoffs around that same period, those models may have seen the underlying vulnerability reports during training. Disconnecting models from the internet during testing prevents live lookups but does not eliminate prior exposure, meaning high scores could reflect memorization of pretraining content rather than discovery of new issues. OpenZeppelin also noted the dataset’s small size, which increases the relative impact of any contamination.
Second, OpenZeppelin found factual errors and misclassifications in the dataset. At least four items labeled as high-severity vulnerabilities are, according to OpenZeppelin, not actually exploitable. EVMbench’s scoring nonetheless treated detection of those issues as correct, inflating agent performance. OpenZeppelin stressed these were not subjective severity disagreements but cases where the described exploits do not work in practice.
EVMbench’s initial leaderboard placed Anthropic’s Claude Open 4.6 first, OpenAI’s OC-GPT-5.2 second, and Google’s Gemini 3 Pro third. OpenZeppelin’s review emphasizes that while AI will change smart contract security, benchmarks and data must be held to rigorous standards—on par with the contracts they aim to protect. The firm recommends dataset revisions and methodological updates so EVMbench measures genuine capability to find previously unseen vulnerabilities rather than recall pretraining content or reward invalid findings.