Close Menu
Cryprovideos
    What's Hot

    Dogecoin And Shiba Inu Could Be Gearing Up For One other Rally After This Occurred | Bitcoinist.com

    March 20, 2026

    7 Key Advantages That Set DeFi Aside from Banks – UseTheBitcoin

    March 20, 2026

    What Occurs to Bitcoin if Financial institution of America's 'Three Situations' for Fed Fee Hikes Hit? – Decrypt

    March 20, 2026
    Facebook X (Twitter) Instagram
    Cryprovideos
    • Home
    • Crypto News
    • Bitcoin
    • Altcoins
    • Markets
    Cryprovideos
    Home»Markets»OpenAI Says Benchmark Used to Measure AI Coding Ability Is 'Contaminated'—Right here's Why – Decrypt
    OpenAI Says Benchmark Used to Measure AI Coding Ability Is 'Contaminated'—Right here's Why – Decrypt
    Markets

    OpenAI Says Benchmark Used to Measure AI Coding Ability Is 'Contaminated'—Right here's Why – Decrypt

    By Crypto EditorFebruary 25, 2026No Comments4 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


    In short

    • OpenAI argues that SWE-bench Verified not displays actual coding skill as a result of the benchmark is allegedly contaminated.
    • It’s now pushing SWE-bench Professional as more durable alternative.
    • Scores plunged from ~70% to ~23% on the newer benchmark,

    The quantity that each main AI lab has been utilizing to assert coding supremacy was simply declared meaningless.

    OpenAI printed a submit this week saying that SWE-bench Verified, the go-to benchmark for measuring AI coding capabilities, is so riddled with flawed assessments and coaching knowledge leakage that it not tells you something helpful about whether or not a mannequin can truly write software program.

    The benchmark works like this: Give an AI an actual GitHub challenge from a preferred open-source Python undertaking, ask it to repair the bug with out seeing the assessments, and test if its patch makes the failing assessments cross with out breaking anything.

    OpenAI created SWE-bench Verified in August 2024 as a cleaner model of the unique 2023 benchmark, recruiting 93 software program engineers to filter out duties that have been unimaginable or poorly designed.

    The cleanup labored nicely sufficient that each main lab began citing scores on it as proof of progress. When Anthropic launched Claude Opus 4 in Could 2025, Decrypt reported that the mannequin scored 72.5% on SWE-bench Verified, beating GPT-4.1’s 54.6% and Gemini 2.5 Professional’s 63.2%. It was the coding benchmark that mattered.

    Since then, each single AI lab from America to China has proven the SWE efficiency to assert the throne as the perfect mannequin for coding capabilities.

    Picture: Minimax

    Now OpenAI says that race was partly a mirage. Based on the report, the crew audited 138 duties that GPT-5.2 constantly failed throughout 64 impartial runs, and had six engineers evaluation every one. It in the end concluded that 59.4% of these duties are damaged.

    About 35.5% have assessments so narrowly written that they require a selected operate title by no means talked about in the issue description. One other 18.8% test for options that weren’t a part of the unique drawback in any respect, gathered from unrelated pull requests.

    The contamination drawback roughly works like this: SWE-bench pulls its issues from open-source repositories that the majority AI firms crawl when constructing coaching units. OpenAI examined whether or not GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash Preview had seen the benchmark’s options throughout coaching. All three had.

    Given solely a job ID and a short trace, every mannequin might reproduce the precise code repair from reminiscence, together with variable names and inline feedback that seem nowhere in the issue description. In a single case, GPT-5.2’s chain-of-thought logs confirmed it reasoning {that a} particular parameter will need to have been “added round Django 4.1″—a element discovered solely in Django’s launch notes, not the duty description. It was answering a query it had already seen the reply to.

    OpenAI now recommends SWE-bench Professional, a more moderen benchmark from Scale AI that makes use of extra numerous codebases and licenses that cut back coaching knowledge publicity. The efficiency drop is jarring: fashions that cleared 70% on the outdated Verified benchmark rating round 23% on SWE-bench Professional’s public cut up, and even much less on its personal duties.

    On the present public SWE-bench Verified leaderboard, OpenAI is much from the benchmark’s podium. Retiring a benchmark the place you are shedding and endorsing one the place everybody begins at 23% resets the scoreboard at a handy second and makes the rivals’ claims much less spectacular.

    That is particularly essential contemplating that the a lot anticipated newer model of DeepSeek is rumored to beat or get extraordinarily near American ai fashions, particularly in agentic and coding duties with a free, open-source mannequin. That mannequin may very well be days away from launch, and SWE-bench Verified could be a key metric to measure its high quality.

    OpenAI mentioned it is constructing privately authored evaluations that will not be launched earlier than testing, pointing to its GDPVal undertaking the place area consultants write authentic duties graded by educated human reviewers.

    The benchmark drawback isn’t new, and isn’t distinctive to coding. AI labs have cycled via a number of evaluations, every helpful till fashions have been educated on them or till the duties proved too slim.

    However what makes this case notable is that OpenAI hyped SWE-bench Verified, promoted it throughout mannequin releases, and is now publicly documenting how completely it has failed—together with by exhibiting their very own mannequin dishonest on it.

    Day by day Debrief Publication

    Begin day by day with the highest information tales proper now, plus authentic options, a podcast, movies and extra.



    Supply hyperlink

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

    Related Posts

    Dogecoin And Shiba Inu Could Be Gearing Up For One other Rally After This Occurred | Bitcoinist.com

    March 20, 2026

    Trump White Home Proposes Nationwide AI Framework, Urges Federal Customary

    March 20, 2026

    Wall Avenue is 'ring-fencing' the blockchain tech as Nasdaq’s tokenization plan wins a significant regulatory battle

    March 20, 2026

    AAVE Worth Prediction: Restoration to $125-$135 Vary by April 2026

    March 20, 2026
    Latest Posts

    What Occurs to Bitcoin if Financial institution of America's 'Three Situations' for Fed Fee Hikes Hit? – Decrypt

    March 20, 2026

    American bitcoin climbs to six,899 BTC in company treasury

    March 20, 2026

    Morgan Stanley Recordsdata Second Modification for Direct Spot Bitcoin ETF Product

    March 20, 2026

    Bitcoin Value Holds $70,000 As Battle-Pushed Inflation Concern Rises

    March 20, 2026

    Bitcoin Is Rising To The Quantum Problem, Galaxy Report Says

    March 20, 2026

    Morning Minute: Bitcoin Rebounds as Oil Falls – Decrypt

    March 20, 2026

    BitFuFu Cuts Self-Mined Bitcoin by 60% in 2025

    March 20, 2026

    Bitcoin worth evaluation: BTC faces new headwind from rising charge hike odds

    March 20, 2026

    CryptoVideos.net is your premier destination for all things cryptocurrency. Our platform provides the latest updates in crypto news, expert price analysis, and valuable insights from top crypto influencers to keep you informed and ahead in the fast-paced world of digital assets. Whether you’re an experienced trader, investor, or just starting in the crypto space, our comprehensive collection of videos and articles covers trending topics, market forecasts, blockchain technology, and more. We aim to simplify complex market movements and provide a trustworthy, user-friendly resource for anyone looking to deepen their understanding of the crypto industry. Stay tuned to CryptoVideos.net to make informed decisions and keep up with emerging trends in the world of cryptocurrency.

    Top Insights

    Holding Your Crypto Concepts Protected: What the SEC’s Newest Strikes Imply for Companies

    March 6, 2025

    3 Finest Meme Cash Set to Skyrocket as The Crypto Market Explodes

    April 28, 2025

    Theta Value Replace: THETA Strikes Close to $0.45 — Why EV2 Presale Is Capturing Investor Consideration Throughout Crypto and Gaming Communities

    November 13, 2025

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    • Home
    • Privacy Policy
    • Contact us
    © 2026 CryptoVideos. Designed by MAXBIT.

    Type above and press Enter to search. Press Esc to cancel.