Close Menu
Cryprovideos
    What's Hot

    Bitcoin ETFs Are Again: $258 Million in 24 Hours Recorded Amid Institutional Market Comeback – U.Immediately

    February 25, 2026

    Solana (SOL) Jumps 7% Day by day, Bitcoin (BTC) Rebounds to $65K: Market Watch

    February 25, 2026

    Crypto Alternate Kraken Launches 'World’s First' Tokenized Fairness Perpetual Futures – The Day by day Hodl

    February 25, 2026
    Facebook X (Twitter) Instagram
    Cryprovideos
    • Home
    • Crypto News
    • Bitcoin
    • Altcoins
    • Markets
    Cryprovideos
    Home»Markets»OpenAI Says Benchmark Used to Measure AI Coding Ability Is 'Contaminated'—Right here's Why – Decrypt
    OpenAI Says Benchmark Used to Measure AI Coding Ability Is 'Contaminated'—Right here's Why – Decrypt
    Markets

    OpenAI Says Benchmark Used to Measure AI Coding Ability Is 'Contaminated'—Right here's Why – Decrypt

    By Crypto EditorFebruary 25, 2026No Comments4 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


    In short

    • OpenAI argues that SWE-bench Verified not displays actual coding skill as a result of the benchmark is allegedly contaminated.
    • It’s now pushing SWE-bench Professional as more durable alternative.
    • Scores plunged from ~70% to ~23% on the newer benchmark,

    The quantity that each main AI lab has been utilizing to assert coding supremacy was simply declared meaningless.

    OpenAI printed a submit this week saying that SWE-bench Verified, the go-to benchmark for measuring AI coding capabilities, is so riddled with flawed assessments and coaching knowledge leakage that it not tells you something helpful about whether or not a mannequin can truly write software program.

    The benchmark works like this: Give an AI an actual GitHub challenge from a preferred open-source Python undertaking, ask it to repair the bug with out seeing the assessments, and test if its patch makes the failing assessments cross with out breaking anything.

    OpenAI created SWE-bench Verified in August 2024 as a cleaner model of the unique 2023 benchmark, recruiting 93 software program engineers to filter out duties that have been unimaginable or poorly designed.

    The cleanup labored nicely sufficient that each main lab began citing scores on it as proof of progress. When Anthropic launched Claude Opus 4 in Could 2025, Decrypt reported that the mannequin scored 72.5% on SWE-bench Verified, beating GPT-4.1’s 54.6% and Gemini 2.5 Professional’s 63.2%. It was the coding benchmark that mattered.

    Since then, each single AI lab from America to China has proven the SWE efficiency to assert the throne as the perfect mannequin for coding capabilities.

    Picture: Minimax

    Now OpenAI says that race was partly a mirage. Based on the report, the crew audited 138 duties that GPT-5.2 constantly failed throughout 64 impartial runs, and had six engineers evaluation every one. It in the end concluded that 59.4% of these duties are damaged.

    About 35.5% have assessments so narrowly written that they require a selected operate title by no means talked about in the issue description. One other 18.8% test for options that weren’t a part of the unique drawback in any respect, gathered from unrelated pull requests.

    The contamination drawback roughly works like this: SWE-bench pulls its issues from open-source repositories that the majority AI firms crawl when constructing coaching units. OpenAI examined whether or not GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash Preview had seen the benchmark’s options throughout coaching. All three had.

    Given solely a job ID and a short trace, every mannequin might reproduce the precise code repair from reminiscence, together with variable names and inline feedback that seem nowhere in the issue description. In a single case, GPT-5.2’s chain-of-thought logs confirmed it reasoning {that a} particular parameter will need to have been “added round Django 4.1″—a element discovered solely in Django’s launch notes, not the duty description. It was answering a query it had already seen the reply to.

    OpenAI now recommends SWE-bench Professional, a more moderen benchmark from Scale AI that makes use of extra numerous codebases and licenses that cut back coaching knowledge publicity. The efficiency drop is jarring: fashions that cleared 70% on the outdated Verified benchmark rating round 23% on SWE-bench Professional’s public cut up, and even much less on its personal duties.

    On the present public SWE-bench Verified leaderboard, OpenAI is much from the benchmark’s podium. Retiring a benchmark the place you are shedding and endorsing one the place everybody begins at 23% resets the scoreboard at a handy second and makes the rivals’ claims much less spectacular.

    That is particularly essential contemplating that the a lot anticipated newer model of DeepSeek is rumored to beat or get extraordinarily near American ai fashions, particularly in agentic and coding duties with a free, open-source mannequin. That mannequin may very well be days away from launch, and SWE-bench Verified could be a key metric to measure its high quality.

    OpenAI mentioned it is constructing privately authored evaluations that will not be launched earlier than testing, pointing to its GDPVal undertaking the place area consultants write authentic duties graded by educated human reviewers.

    The benchmark drawback isn’t new, and isn’t distinctive to coding. AI labs have cycled via a number of evaluations, every helpful till fashions have been educated on them or till the duties proved too slim.

    However what makes this case notable is that OpenAI hyped SWE-bench Verified, promoted it throughout mannequin releases, and is now publicly documenting how completely it has failed—together with by exhibiting their very own mannequin dishonest on it.

    Day by day Debrief Publication

    Begin day by day with the highest information tales proper now, plus authentic options, a podcast, movies and extra.



    Supply hyperlink

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

    Related Posts

    HBAR Value Prediction: Impartial Consolidation Targets $0.10-$0.11 Vary by March 2026

    February 25, 2026

    Former FTX Boss Fails to Achieve Trump’s Assist for Pardon Bid

    February 25, 2026

    LDO Value Prediction: Oversold Rally Targets $0.35 by March 2026

    February 25, 2026

    Pi Community Explodes in Progress One 12 months After Open Community Launch

    February 25, 2026
    Latest Posts

    Bitcoin ETFs Are Again: $258 Million in 24 Hours Recorded Amid Institutional Market Comeback – U.Immediately

    February 25, 2026

    Solana (SOL) Jumps 7% Day by day, Bitcoin (BTC) Rebounds to $65K: Market Watch

    February 25, 2026

    Bitcoin ETF Flows Hit $258M in Largest Each day Inflows in Weeks

    February 25, 2026

    Technique Faces $9B Bitcoin Drawdown – Right here Is Why This Crypto Guess Isn’t Altering – BlockNews

    February 25, 2026

    Todd Urges Discord to Settle for BTC to Keep away from ID Checks – U.In the present day

    February 25, 2026

    Bitcoin Drifting Towards the Lengthy-Time period Holder Ache Level: Analysts

    February 25, 2026

    Bitcoin Adoption Hit Document Highs in 2025, Says River

    February 25, 2026

    Anchorage Digital holds Technique holds bitcoin holder Technique's most popular inventory

    February 25, 2026

    CryptoVideos.net is your premier destination for all things cryptocurrency. Our platform provides the latest updates in crypto news, expert price analysis, and valuable insights from top crypto influencers to keep you informed and ahead in the fast-paced world of digital assets. Whether you’re an experienced trader, investor, or just starting in the crypto space, our comprehensive collection of videos and articles covers trending topics, market forecasts, blockchain technology, and more. We aim to simplify complex market movements and provide a trustworthy, user-friendly resource for anyone looking to deepen their understanding of the crypto industry. Stay tuned to CryptoVideos.net to make informed decisions and keep up with emerging trends in the world of cryptocurrency.

    Top Insights

    Mode Community New Voting System And Crypto Alternatives

    January 12, 2025

    A Lady For The Job: Trump Favors Crypto-Savvy Lawyer To Exchange SEC Chief Gensler

    November 20, 2024

    Coinbase inventory surges after JPMorgan improve on Base, USDC potential

    October 27, 2025

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    • Home
    • Privacy Policy
    • Contact us
    © 2026 CryptoVideos. Designed by MAXBIT.

    Type above and press Enter to search. Press Esc to cancel.