Rongchai Wang
Mar 05, 2026 00:55
New benchmark evaluates AI brokers’ capability to detect, patch, and exploit sensible contract vulnerabilities. GPT-5.3-Codex scores 72.2% on exploit duties.
OpenAI and crypto enterprise agency Paradigm have launched EVMbench, a benchmark that measures how effectively AI brokers can discover, repair, and exploit vulnerabilities in Ethereum sensible contracts. The announcement comes as AI-powered safety instruments race to guard the $100 billion-plus locked in DeFi protocols.
The benchmark attracts from 120 curated high-severity vulnerabilities pulled from 40 actual safety audits, largely from Code4rena competitions. It additionally contains vulnerability situations from safety opinions of Tempo, a Layer 1 blockchain constructed for stablecoin funds.
Three Methods to Break Good Contracts
EVMbench assessments AI brokers throughout three distinct modes. In Detect mode, brokers audit contract repositories and get scored on discovering identified vulnerabilities. Patch mode requires brokers to repair weak code with out breaking current performance. Exploit mode is essentially the most aggressive—brokers should execute precise fund-draining assaults in opposition to contracts deployed on a sandboxed blockchain.
The outcomes present how rapidly AI capabilities are advancing on this area. GPT-5.3-Codex operating by way of Codex CLI hit a 72.2% success price on exploit duties. That is greater than double the 31.9% rating from GPT-5, which launched simply six months prior.
Curiously, AI brokers carry out higher at attacking than defending. The exploit setting has a transparent goal—preserve iterating till you drain the funds. Detection and patching proved more durable. Brokers typically stopped after discovering one bug as an alternative of auditing exhaustively, and sustaining full contract performance whereas eradicating delicate vulnerabilities remained difficult.
Actual Limitations Value Noting
OpenAI acknowledged EVMbench does not seize the complete problem of real-world contract safety. Closely deployed protocols like Uniswap or Aave bear much more scrutiny than audit competitors code. The benchmark can also’t confirm if an agent finds official vulnerabilities that human auditors missed—it solely checks in opposition to identified points.
The exploit surroundings runs on a clear native Anvil occasion slightly than forked mainnet state, and timing-dependent assaults fall exterior scope. Single-chain environments just for now.
$10M for Defensive Analysis
Alongside EVMbench, OpenAI dedicated $10 million in API credit particularly for defensive safety analysis. The corporate is increasing its Aardvark safety analysis agent to extra customers and partnering with open-source maintainers at no cost codebase scanning.
The timing issues. As AI brokers get higher at exploiting contracts, the window between vulnerability discovery and exploitation shrinks. Protocol groups that are not utilizing AI-assisted auditing will more and more discover themselves at an obstacle in opposition to attackers who’re.
OpenAI launched EVMbench’s duties, tooling, and analysis framework publicly. For DeFi builders and safety researchers, it is each a measuring stick and a warning about the place AI capabilities are headed.
Picture supply: Shutterstock

