In short
- EVMbench exams AI brokers on 120 real-world Ethereum good contract vulnerabilities.
- Software evaluates detection, patching, and exploitation throughout three distinct modes.
- GPT-5.3-Codex achieved 72.2% success fee in exploit mode testing.
ChatGPT maker OpenAI and crypto-focused funding agency Paradigm have launched EVMbench, a device to assist enhance Ethereum Digital Machine good contract safety.
EVMbench is designed to guage AI brokers’ capability to detect, patch, and exploit high-severity vulnerabilities in Ethereum Digital Machine (EVM) good contracts.
Good contracts are the guts of the Ethereum community, holding the code that powers every thing from decentralized finance protocols to token launches. The weekly variety of good contracts deployed on Ethereum reached an all-time excessive of 1.7 million in November 2025, with 669,500 deployed final week alone, in accordance with Token Terminal.
EVMbench attracts on 120 curated vulnerabilities from 40 audits, most sourced from open audit competitions equivalent to Code4rena, in accordance with an OpenAI weblog submit. It additionally consists of eventualities from the safety auditing course of for Tempo, Stripe’s purpose-built layer-1 blockchain centered on high-throughput, low-cost stablecoin funds.
Funds large Stripe launched the general public testnet for Tempo in December, saying on the time that it was being constructed with enter from Visa, Shopify, and OpenAI, amongst others.
The aim is to floor testing in economically significant, real-world code—notably as AI-driven stablecoin funds broaden, the agency added.
Introducing EVMbench—a brand new benchmark that measures how effectively AI brokers can detect, exploit, and patch high-severity good contract vulnerabilities. https://t.co/op5zufgAGH
— OpenAI (@OpenAI) February 18, 2026
EVMbench is supposed to guage AI fashions throughout three modes: Detect, patch, and exploit. In “detect,” brokers audit repositories and are scored on their recall of ground-truth vulnerabilities. In “patch,” brokers should eradicate vulnerabilities with out breaking meant performance. Lastly, within the “exploit” part, brokers try end-to-end fund-draining assaults in a sandboxed blockchain atmosphere, with grading carried out through deterministic transaction replay.
In exploit mode, GPT-5.3-Codex operating through OpenAI’s Codex CLI achieved a rating of 72.2%, in comparison with 31.9% for GPT-5, which was launched six months earlier. Efficiency was weaker within the detect and patch duties, the place brokers typically didn’t audit exhaustively or struggled to protect full contract performance.
The ChatGPT makers’ researchers cautioned that EVMbench doesn’t totally seize real-world safety complexity. Nonetheless, they added that measuring AI efficiency in economically related environments is vital as fashions develop into highly effective instruments for each attackers and defenders.
Sam Altman’s OpenAI and Ethereum co-founder Vitalik Buterin have beforehand been at odds over the tempo of AI growth.
In January 2025, Altman stated that his agency was “assured we all know the way to construct AGI as we have now historically understood it.” However Buterin advocated that AI programs ought to embody a “gentle pause” functionality that might quickly limit industrial-scale AI operations if warning indicators emerge.
Day by day Debrief Publication
Begin day-after-day with the highest information tales proper now, plus unique options, a podcast, movies and extra.

