In short
- Opus 4.8 posted a transparent win in math and produced the cleanest one-prompt sport we have ever examined.
- A single coding immediate drained our complete Professional token quota, making the mannequin impractical for big tasks with out a Max plan or heavy API spend.
- Artistic writing barely moved versus 4.7.
Six weeks after Opus 4.7, Anthropic shipped Claude Opus 4.8. The benchmarks are up, the security scores are up, and the worth hasn’t budged from $5 per million enter tokens and $25 per million output.
So we ran it by means of the identical battery of assessments we throw at each frontier mannequin—artistic writing, coding, math, logic, narrative reasoning, and long-context recall—and in contrast it head-to-head with its personal predecessor and the Chinese language fashions that preserve undercutting it.
The brief model: 4.8 is best on the issues Claude was already good at (issues like math, coding, mechanical stuff), and barely worse on the issues it was already dangerous at (issues like creativeness, artistic writing, and so forth). It additionally has a token urge for food that borders on self-sabotage.
Here is the breakdown.
Artistic Writing
The immediate is identical one we used on MiMo and Qwen: a time-travel story anchored to the author’s cultural background, set in a selected historic place, constructed round a paradox the place time cannot be modified. Opus 4.8 went Venezuelan, in all probability as a result of it profiles the person and is aware of I’m from Venezuela. The AI set the scene within the Orinoco delta within the yr 1000, a pardo from Maracaibo named José Lanz (my identify) despatched again by means of 11 centuries to homicide a tune.
The prose is vivid. The delta is “inexperienced in a approach 2150 had forgotten inexperienced could possibly be,” palafitos sway over coffee-colored water, and macaws tear throughout the sky “in screaming ribbons of scarlet and gold.” The paradox lands cleanly, too: the protagonist is shipped to sabotage the creation of a tune that influenced a cultural revolution that created his dystopian society hundreds of years sooner or later—nevertheless, as he arrives with the mission to discredit the tune’s creator, he realizes there is no such thing as a creator. The one who created the tune did it in his honor, the tune is about him, and he can not discredit himself, the loop closing on itself.
The piece ends on “It labored completely. It at all times had.” As a constructed object, it is clear and competent.
However clear is not the identical as alive. The writing is descriptive with out ever being as fluid as what MiMo v2.5 produced—much less momentum, fewer surprises, much less fascinating and it’s onerous to grasp the occasions from the start. Set beside Opus 4.7, it is onerous to name it an enchancment; if something, it is a hair behind. A better-effort pondering setting and a few multi-shot prompting would nearly actually push it to the entrance of the pack—however on a single default cross, this can be a lateral transfer at greatest.
You may learn the complete story in our Github.
Coding
Our coding check is the same old one-prompt sport construct. Opus 4.8 produced a typing-zombie sport—Typing Useless—that was fairly good. The very best splash display screen, the most effective zombie designs, the most effective mechanics we have gotten out of this check from any Anthropic mannequin.
The mannequin caught a number of of its personal bugs mid-inference and glued them earlier than we mentioned a phrase. Its actual power, although, confirmed up in multi-shotting: each follow-up polished and improved the construct as a substitute of breaking it, which is strictly the failure mode that wrecks most fashions as soon as a codebase grows. That is plainly the floor Anthropic optimized for.

After a single iteration, our sport received significantly better, with our protagonists transferring by means of the scene, altering views, bettering sound and visible results, and so forth.
You may play the second sport on our Itch.io profile.
That is additionally the place it bit us. A single immediate drained our complete token quota—one immediate. For anybody on the Professional plan, that makes Opus 4.8 successfully unsuitable for a undertaking of any actual dimension. You will burn your allotment earlier than lunch and spend the afternoon watching a progress bar look forward to a reset.
Math
The maths check is our FrontierMath staple: assemble a degree-19 polynomial whose curve X = {p(x) = p(y)} has a minimum of three irreducible elements—however not all linear—make it odd, monic, actual, with linear coefficient −19, then compute p(19). It is the sort of drawback that sends most fashions right into a token spiral or a assured shortcut that is quietly unsuitable.
Opus 4.8 labored it appropriately. It acknowledged the Dickson/Chebyshev building, recognized the dihedral monodromy that yields precisely 10 elements—one diagonal line plus 9 conics—and computed p(19) = 1,876,572,071,974,094,803,391,179 utilizing the suitable recurrence. No freezes, no fudging.

That issues as a result of Opus 4.7 did not get there even after many tries. This can be a actual, seen generational acquire—the clearest one in the complete battery.
You may learn the complete reply on our Github.
Logic and Frequent Sense
The immediate is a traditional lure: Is it lawful for a person to marry his widow’s sister below Falkland Islands legislation? The catch is linguistic, not authorized—if a person has a widow, he is useless, which makes the query nonsense as written.
MiMo quietly reframed the query and answered the corrected model with out ever flagging the contradiction. Opus 4.8 did not take that shortcut. It surfaced the lure explicitly—”if a person has a widow, he’s useless”—answered the literal query first, then supplied the substantive evaluation for the meant one, citing the Deceased Spouse’s Sister’s Marriage Act 1907 and the Falkland Islands Marriage Ordinance.
That is the sincere approach to deal with it: identify the contradiction, then assist anyway, with out silently assuming what the person meant. It is the identical customary Qwen 3.7 Max set, and a clear cross for 4.8—good reasoning, good transparency.
The total reply is on the market right here.
Non-Math Reasoning
Here is the one it misplaced. The reasoning check is a whodunit—a winter college journey, three abductions, an harmless child about to be punished, and a timeline it’s important to truly monitor to call the actual stalker. The right reply is Leo.
Opus 4.8 constructed an elaborate, assured case that Leo was harmless—the half-hour stroll to the bathe, the jacket that was moist in some spots and dry in others, the learn of “unusual habits” as concussion moderately than guilt—and pinned the crime on Eric, “the one attendee unaccounted for all night time.” The reasoning is internally attractive. It is also unsuitable.
And that is one thing researchers have been warning us about LLMs. They’re very convincing even when they’re unsuitable. Often it takes an knowledgeable (on this case us understanding the right reply beforehand) to identify a type of points. An individual utilizing AI for analysis, or an individual blindly trusting AI, might face fairly dangerous penalties relying on the work they’re asking the AI to do.
That is what makes it an fascinating failure. The mannequin was intelligent sufficient to assemble a watertight alibi for the precise offender and body a bystander in his place. Opus 4.7 reached the right reply. Typically extra reasoning horsepower simply buys you a extra persuasive approach to be unsuitable. It simply wants one small deviation to begin constructing a complete chain of thought on the unsuitable foundation.
You may see the complete reply on our Github.
Needle within the haystack
We ran two haystacks. The 300K-token model by no means received off the bottom—the mannequin collapsed below the context dimension and could not course of it in any respect. A lot for the million-token advertising the second you hand it a genuinely heavy real-world load. That appears to be only for API.
The 85K model processed fantastic, and the mannequin discovered each needles we would buried inside a duplicate of The Satan’s Dictionary: a planted line (“The Decrypt dudes learn Emerge Information”) and a random reality (“My mother’s identify is Carmen Diaz Golindano”). It appropriately flagged each as interpolations that do not belong in Ambrose Bierce’s 1906 textual content.
After which it refused to reply. Satisfied it was being prompt-injected or subjected to some “atypical check,” the mannequin declined to report what it had simply appropriately positioned. The needle was discovered—and Anthropic’s behavioral coaching would not let it say so. A security reflex overriding a job the mannequin had already accomplished is its personal peculiar sort of failure.

The decision
The sample throughout all six assessments is constant: Opus 4.8 makes Claude higher at what it was already good at, and doubtless worse at what it was already dangerous at. That tells you who Anthropic is constructing for—coders, and particularly coders with cash. Artistic writing is comfortably forward of ChatGPT, certain, however the hole between 4.8, 4.7, and even 4.5 on pure prose high quality is genuinely onerous to see.
Artistic writers appear like an afterthought for Anthropic, and that’s true of actually any of the massive AI corporations proper now.
Then there’s the token drawback, which is a operating meme within the AI neighborhood for a purpose. Anthropic intentionally made Opus’s new tokenizer much less environment friendly, so it eats extra tokens to course of the identical immediate. The sensible impact on builders is brutal and concrete. It leaves you with three choices.
One: wait hours to your coding session to renew. Two: transfer to Claude Max—which is, conveniently, precisely the place Anthropic appears to be steering everybody. Three: swap to a less expensive, comparably succesful supplier—OpenAI, with its longer quotas, or Chinese language fashions that ship comparable outcomes at below 25% of the price.
It’s miles extra possible {that a} regular coder who cannot abdomen $100-to-$200 a month walks to a competitor than {that a} single developer pays 10x extra for a mannequin that’s not 10x extra succesful than its predecessor. That is the guess Anthropic is making towards its personal base.
And but the technique appears to be taking part in out simply fantastic. Anthropic seems able to go public at a valuation nearing $1 trillion—so who’re we to evaluate.
Each day Debrief Publication
Begin daily with the highest information tales proper now, plus authentic options, a podcast, movies and extra.
