OpenAI Abandons SWE-bench Verified After Discovering 59% of Failed Exams Have been Flawed

OpenAI has stopped reporting scores on SWE-bench Verified, the widely-used AI coding benchmark, after discovering that almost 60% of issues its fashions failed contained essentially damaged assessments. The corporate’s February 23, 2026 evaluation additionally discovered proof that every one main frontier fashions—together with GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash—had been skilled on benchmark options, rendering scores meaningless.

“Enhancements on SWE-bench Verified now not mirror significant enhancements in fashions’ real-world software program growth talents,” OpenAI acknowledged. “As a substitute, they more and more mirror how a lot the mannequin was uncovered to the benchmark at coaching time.”

The Numbers Inform the Story

OpenAI audited 138 issues—27.6% of the 500-problem dataset—that its o3 mannequin could not persistently clear up throughout 64 unbiased runs. The findings have been damning: 59.4% of those issues had materials points in check design or drawback descriptions that made them “extraordinarily troublesome or unimaginable even for probably the most succesful mannequin or human to resolve.”

Breaking down the failures: 35.5% of audited duties had overly strict assessments that rejected functionally right options by demanding particular implementation particulars by no means talked about in drawback descriptions. One other 18.8% examined for performance that wasn’t even specified within the process.

One instance concerned a pylint PR the place assessments required importing a operate referred to as “get_annotation”—a reputation by no means talked about in the issue assertion. Fashions that solved the underlying subject appropriately nonetheless failed as a result of they did not psychically guess the anticipated operate identify.

Each Main Mannequin Is Contaminated

The contamination proof proved extra troubling. OpenAI constructed an automatic red-teaming system utilizing GPT-5 to probe competing fashions for benchmark data. The outcomes confirmed all examined frontier fashions might reproduce authentic human-written options or quote verbatim drawback particulars they need to by no means have seen.

GPT-5.2, when given minimal hints, reproduced the precise code patch for a Django authentication repair—together with the precise conditional assertion “if username is None or password is None.” Claude Opus 4.5 quoted word-for-word an inline remark from a gold patch it supposedly by no means encountered. Gemini 3 Flash, given solely a process ID, output the entire unified diff with right line numbers.

The contamination creates an unfair benefit. Fashions which have seen options throughout coaching can move underspecified assessments by “remembering” implementation particulars that weren’t in the issue description—primarily having the reply key earlier than the examination.

From 80% to 23%

The benchmark’s decay turned seen in stalled progress. State-of-the-art scores improved solely from 74.9% to 80.9% over six months—not as a result of fashions hit functionality ceilings, however as a result of the remaining issues have been both unimaginable or required memorized data.

SWE-bench Professional, the advisable substitute, paints a distinct image. In accordance with latest information from February 26, 2026, fashions scoring 80% on Verified dropped to roughly 23% on Professional—a benchmark designed to withstand contamination. Claude Opus 4.6 at present leads Professional with 79.20% efficiency, although that determine measures a distinct, cleaner check set.

What Comes Subsequent

OpenAI recommends the business shift to SWE-bench Professional’s public break up whereas acknowledging it is imperfect. The corporate is investing in privately-authored benchmarks like GDPVal, the place area specialists create authentic duties and skilled reviewers grade options holistically.

The broader lesson issues for anybody monitoring AI capabilities: benchmarks sourced from public repositories carry inherent contamination danger. When coaching information contains the check, scores develop into theater. For researchers, buyers, and builders betting on AI coding progress, the actual frontier is tougher to measure than leaderboards recommend.

Picture supply: Shutterstock

Supply hyperlink

What's Hot

Mira Murati’s Inkling AI Mannequin Evaluate: Greatest Open-Supply Mannequin within the West – Decrypt

Upbit Lists Euler (EUL) on KRW Market as Token Jumps Into Highlight

Robinhood Bets on 3 Crypto Sectors as Blockchain Charges Hit $25 Million

OpenAI Abandons SWE-bench Verified After Discovering 59% of Failed Exams Have been Flawed

Mira Murati’s Inkling AI Mannequin Evaluate: Greatest Open-Supply Mannequin within the West – Decrypt

Upbit Lists Euler (EUL) on KRW Market as Token Jumps Into Highlight

CRV Worth Prediction: Oversold and Coiled at $0.20 — Squeeze Setup or Gradual Bleed By means of August?

U.S. regulator warns prediction markets towards chopping corners in occasion contracts

Shiba Inu (SHIB) Enters Prime 25 as $3.3 Billion Prediction Comes True; Hyperliquid Whales Bullish on XRP; AI Brokers Go for Bitcoin on Jack Dorsey's Slack Rival – Morning Crypto Report – U.At this time

Bitcoin OG promoting eases as dormant BTC motion hits 4-year low: Thorn

Overlook Bitcoin? 3 Causes Why Ethereum Is Able to Steal the Highlight in Summer time 2026 – U.Immediately

Two Central Banks Determine Curiosity Charges Subsequent Week: What It Means for Bitcoin

Bitcoin Crash: Technique Says It's Prepared for Worst-Case Situation – U.Right this moment

Robert Kiyosaki Shares a Key Revelation on Gold, Bitcoin and Ethereum

Right here’s What Tesla Did With Its Bitcoin Holdings in Q2 2026

These Meme Cash Steal the Present as Bitcoin Defends $64K Help: Weekend Watch

Top Insights

Greatest Crypto to Purchase Now as Trump’s Bitcoin Miner Surges on Debut

SafeMoon CEO Will get 8 Years in Jail for Crypto Fraud Scheme – Decrypt

Ledger pages blocked as UK’s crypto crackdown hits schooling, promoting, banking

What's Hot

OpenAI Abandons SWE-bench Verified After Discovering 59% of Failed Exams Have been Flawed

The Numbers Inform the Story

Each Main Mannequin Is Contaminated

From 80% to 23%

What Comes Subsequent

Related Posts

Subscribe to Updates