Briefly
- ARC-AGI-3 exposes an enormous hole between AGI claims and actuality, with prime AI fashions scoring beneath 1% whereas people obtain good efficiency.
- The benchmark exams true generalization—requiring brokers to discover, plan, and study from scratch in unknown environments moderately than recall skilled patterns.
- Regardless of trade hype, present AI programs stay removed from AGI, missing the reasoning and flexibility that even younger people show naturally.
Nvidia CEO Jensen Huang went on Lex Fridman’s podcast final week and mentioned, plainly, “I feel we have achieved AGI.” Two days later, probably the most rigorous check in AI analysis dropped its latest synthetic common intelligence benchmark—and each frontier mannequin scored beneath 1%.
The ARC Prize Basis launched ARC-AGI-3 this week, and the outcomes are brutal. Google’s Gemini 3.1 Professional led the pack at 0.37%. OpenAI’s GPT-5.4 got here in at 0.26%. Anthropic’s Claude Opus 4.6 managed 0.25%, whereas xAI’s Grok-4.20 scored precisely zero. People, in the meantime, solved 100% of environments.
This is not a trivia check or coding examination, and even ultra-hard PhD-level questions. ARC-AGI-3 is one thing totally totally different from something the AI trade has confronted earlier than.
The benchmark was constructed by François Chollet and Mike Knoop’s basis, which arrange an in-house recreation studio and created 135 authentic interactive environments from scratch. The concept is to drop an AI agent into an unfamiliar game-like world with zero directions, zero said targets, and no description of the foundations. The agent has to discover, determine what it is speculated to do, kind a plan, and execute it.
If that appears like one thing any five-year-old can do, you are beginning to perceive the issue. If you wish to see in case you are higher than AI, you’ll be able to play the identical video games featured within the check by clicking on this hyperlink. We tried one; it was bizarre at first, however after a couple of seconds, you’ll be able to simply get the dangle of it.

It is also the clearest instance of what the “G” in AGI stands for. Whenever you generalize, you’ll be able to create new data (how a bizarre recreation works) with out being skilled on it upfront.
Earlier variations of ARC examined static visible puzzles—present a sample, predict the following one. They have been arduous at first. Then the labs threw compute energy and coaching at them till the benchmarks have been successfully useless. ARC-AGI-1, launched in 2019, fell to test-time coaching and reasoning fashions. ARC-AGI-2 lasted a couple of yr earlier than Gemini 3.1 Professional hit 77.1%. The labs are superb at saturating benchmarks they’ll prepare in opposition to.
Model 3 was designed particularly to forestall that. With 110 of the 135 environments saved non-public—55 semi-private for API testing, 55 absolutely locked for competitors—there is not any dataset to memorize. You’ll be able to’t brute-force your manner by way of novel recreation logic you have by no means seen.
Scoring is not go/fail both. ARC-AGI-3 makes use of what the muse calls RHAE—Relative Human Motion Effectivity. The baseline is the second-best, first-run human efficiency. An AI that takes ten occasions as many actions as a human scores 1% for that stage, not 10%. The method squares the penalty for inefficiency. Wandering round, backtracking, and guessing your method to a solution will get punished arduous.
The very best AI agent within the month-long developer preview scored 12.58%. Frontier LLMs examined by way of the official API, with no customized tooling, could not crack 1%. Peculiar people solved all 135 environments with no prior coaching and no directions. If that is the bar, then the present crop of fashions is not clearing it.
There may be one actual methodological debate right here. ARC’s report says a Duke-built customized harness pushed Claude Opus 4.6 from 0.25% to 97.1% on a single surroundings variant referred to as TR87. That doesn’t imply Claude scored 97.1% on ARC-AGI-3 general; its official benchmark rating remained 0.25%, however the shift continues to be value noting.
The official benchmark feeds brokers JSON code, not visuals. That is both a methodological flaw or an illustration that at the moment’s fashions are higher at processing human-friendly data than uncooked structured information. Chollet’s basis has acknowledged the controversy, however is not altering the format.
“Body content material notion and API format are usually not limiting elements for frontier mannequin efficiency on ARC-AGI-3,” the paper reads. In different phrases, they appear to reject the concept fashions fail as a result of they “can’t see” the duties correctly, arguing as a substitute that notion is already adequate—and the actual hole lies in reasoning and generalization.
The AGI actuality examine arrived throughout per week when the hype machine was working at full pace. Moreover Huang’s remark, Arm named its new information heart chip the “AGI CPU.” OpenAI’s Sam Altman has mentioned they’ve “mainly constructed AGI,” and Microsoft is already advertising and marketing a lab targeted on constructing ASI: An evolution of what comes after AGI is achieved. The time period is being stretched till it means no matter is commercially handy, it seems.
Chollet’s place is less complicated. If a traditional human with no directions can do it, and your system cannot, then you do not have AGI—you’ve got a really costly autocomplete that wants a whole lot of assist.
ARC Prize 2026 is providing $2 million throughout three competitors tracks, all hosted on Kaggle. Each profitable resolution have to be open-sourced. The clock is working, and proper now, the machines aren’t even shut.
Day by day Debrief Publication
Begin every single day with the highest information tales proper now, plus authentic options, a podcast, movies and extra.
