Anthropic Claude 4 Evaluation: Artistic Genius Trapped by Outdated Limitations - Decrypt

San Francisco-based Anthropic simply dropped the fourth technology of its Claude AI fashions, and the outcomes are… sophisticated. Whereas Google pushes context home windows previous 1,000,000 tokens and OpenAI builds multimodal techniques that see, hear, and communicate, Anthropic caught with the identical 200,000-token restrict and text-only strategy. It is now the odd one out amongst main AI firms.

The timing feels deliberate—Google introduced Gemini this week too, and OpenAI unveiled a brand new coding agent based mostly on its proprietary Codex mannequin. Claude’s reply? Hybrid fashions that shift between reasoning and non-reasoning modes relying on what you throw at them—delivering what OpenAI expects to convey at any time when they launch GPT-5.

However here is one thing for API customers to significantly contemplate: Anthropic is charging premium costs for that improve.

The chatbot app, nevertheless, stays the identical at $20 with Claude Max priced at $200 a month, with 20x larger utilization limits.

We put the brand new fashions by their paces throughout artistic writing, coding, math, and reasoning duties. The outcomes inform an fascinating story with marginal enhancements in some areas, shocking enchancment in others, and a transparent shift in Anthropic’s priorities away from common use towards developer-focused options.

Right here is how each Claude Sonnet 4 and Claude Opus 4 carried out in our totally different checks. (You may verify them out, together with our prompts and outcomes, in our Github repository.)

Artistic writing

Artistic writing capabilities decide whether or not AI fashions can produce partaking narratives, preserve constant tone, and combine factual components naturally. These abilities matter for content material creators, entrepreneurs, and anybody needing AI help with storytelling or persuasive writing.

As of now, there isn’t any mannequin that may beat Claude on this subjective check (not contemplating Longwriter, in fact). So it is mindless to match Claude towards third-party choices. For this process we determined to place Sonnet and Opus face-to-face.

We requested the fashions to put in writing a brief story about an individual who travels again in time to forestall a disaster however finally ends up realizing that their actions from the previous truly have been a part of the occasions that made existence lean in direction of that particular future. The immediate added some particulars to contemplate and gave fashions sufficient liberty and creativity to arrange a narrative as they see match.

Claude Sonnet 4 produced vivid prose with one of the best atmospheric particulars and psychological nuance. The mannequin crafted immersive descriptions and supplied a compelling story, although the ending was not precisely as requested—but it surely match the narrative and the anticipated consequence.

General, Sonnet’s narrative building balanced motion, introspection, and philosophical insights about historic inevitability.

Rating: 9/10—positively higher than Claude 3.7 Sonnet

Claude Opus 4 grounded its speculative fiction in credible historic contexts, referencing indigenous worldviews and pre-colonial Tupi society with cautious consideration to cultural limitations. The mannequin built-in supply materials naturally and supplied an extended story than Sonnet, with out having the ability to match its poetic aptitude, sadly.

It additionally confirmed an fascinating factor: The narrative began much more vividly and was extra immersive than what Sonnet supplied, however someplace across the center, it shifted to hurry the plot twist, making the entire consequence boring and predictable.

Rating: 8/10

Sonnet 4 is the winner for artistic writing, although the margin remained slim. Writers, beware: In contrast to with earlier fashions, it seems that Anthropic hasn’t prioritized artistic writing enhancements, focusing growth efforts elsewhere.

All of the tales can be found right here.

Coding

Coding analysis measures whether or not AI can generate purposeful, maintainable software program that follows finest practices. This functionality impacts builders utilizing AI for code technology, debugging, and architectural choices.

Gemini 2.5 Professional is taken into account the king of AI-powered coding, so we examined it towards Claude Opus 4 with prolonged considering.

We zero-shot our directions for a sport—a robotic that should keep away from journalists in its technique to merge with a pc and obtain AGI—and used one extra iteration to repair bugs and make clear totally different features of the sport.

Claude Opus created a top-down stealth sport with refined mechanics, together with dynamic sound waves, investigative AI states, and imaginative and prescient cone occlusion. The implementation featured wealthy gameplay components: journalists responded to sounds by heardSound flags, obstacles blocked line-of-sight calculations, and procedural technology created distinctive ranges every playthrough.

Rating: 8/10

Google’s Gemini produced a side-scrolling platformer with cleaner structure utilizing ES6 courses and named constants.

The sport was not purposeful after two iterations, however the implementation separated issues successfully: stage.init() dealt with terrain technology, the Journalist class encapsulated patrol logic, and constants like PLAYER_JUMP_POWER enabled simple tuning. Whereas gameplay remained less complicated than Claude’s model, the maintainable construction and constant coding requirements earned notably excessive marks for readability and maintainability.

Verdict: Claude received: It delivered superior gameplay performance that customers would like.

Nevertheless, builders may desire Gemini regardless of all this, because it created cleaner code that may be improved extra simply.

Our immediate and codes can be found right here. And you’ll click on right here to play the sport generated with Claude.

Mathematical reasoning

Mathematical problem-solving checks AI fashions’ capacity to deal with advanced calculations, present reasoning steps, and arrive at appropriate solutions. This issues for instructional purposes, scientific analysis, and any area requiring exact computational considering.

We in contrast Claude and OpenAI’s newest reasoning mannequin, o3, asking the fashions to resolve an issue that appeared on the FrontierMath benchmark—designed particularly to be laborious for fashions to resolve:

“Assemble a level 19 polynomial p(x) ∈ C[x] such that X := {p(x) = p(y)} ⊂ P1 × P1 has a minimum of 3 (however not all linear) irreducible elements over C. Select p(x) to be odd, monic, have actual coefficients and linear coefficient -19 and calculate p(19).”

Claude Opus 4 displayed its full reasoning course of when tackling tough mathematical challenges. The transparency allowed evaluators to hint logic paths and determine the place calculations went mistaken. Regardless of exhibiting all of the work, the mannequin failed to attain good accuracy.

OpenAI’s o3 mannequin achieved 100% accuracy on an identical mathematical duties, marking the primary time any mannequin solved the check issues fully. Nevertheless, o3 truncated its reasoning show, exhibiting solely remaining solutions with out intermediate steps. This strategy prevented error evaluation and made it not possible for customers to confirm the logic or study from the answer course of.

Verdict: OpenAI o3 received the mathematical reasoning class by good accuracy, although Claude’s clear strategy provided instructional benefits. For instance, researchers can have a better time catching failures whereas analyzing the complete Chain of Thought, as a substitute of getting to both absolutely belief the mannequin or clear up the issue manually to corroborate outcomes.

You may verify Claude 4’s Chain of Thought right here.

Non-mathematical reasoning and communication

For this analysis, we needed to check the fashions’ capacity to grasp complexities, craft nuanced messages, and steadiness pursuits. These abilities show important for enterprise technique, public relations, and any situation requiring refined human communication.

We supplied Claude, Grok, and ChatGPT directions to craft a single communication technique that concurrently addresses 5 totally different stakeholder teams a couple of important state of affairs at a big medical heart. Every group has vastly totally different views, emotional states, data wants, and communication preferences.

Claude demonstrated distinctive strategic considering by a three-pillar messaging framework for a hospital ransomware disaster: Affected person Security First, Lively Response, and Stronger Future. The response included particular useful resource allocations of $2.3 million emergency funding, detailed timelines for every stakeholder group, and culturally delicate diversifications for multilingual populations. Particular person board member issues acquired tailor-made consideration whereas sustaining message consistency. The mannequin supplied an excellent set of opening statements to seize an thought of the best way to strategy every viewers.

ChatGPT was additionally good on the process, however not on the similar stage of element and practicality. Whereas offering stable frameworks with clear core ideas, GPT4.1 relied extra on tone variation than substantive content material adaptation. The responses have been intensive and detailed, anticipating questions and moods, and the way our actions might impression these being addressed. Nevertheless, it lacked particular useful resource allocations, detailed deliverables, and different particulars that Claude supplied.

Verdict: Claude wins

You may verify the outcomes and Chain of Thought for every mannequin, right here.

Needle within the haystack

Context retrieval capabilities decide how successfully AI fashions can find particular data inside prolonged paperwork or conversations. This ability proves important for authorized analysis, doc evaluation, tutorial literature evaluations, and any situation requiring exact data extraction from giant textual content volumes.

We examined Claude’s capacity to determine particular data buried inside progressively bigger context home windows utilizing the usual “needle in a haystack” methodology. This analysis concerned inserting a focused piece of knowledge at numerous positions inside paperwork of various lengths and measuring retrieval accuracy.

Claude Sonnet 4 and Opus 4 efficiently recognized the needle when embedded inside an 85,000 token haystack. The fashions demonstrated dependable retrieval capabilities throughout totally different placement positions inside this context vary, sustaining accuracy whether or not the goal data appeared at first, center, or finish of the doc. Response high quality remained constant, with the mannequin offering exact citations and related context across the retrieved data.

Nevertheless, the fashions’ efficiency hit a tough limitation when making an attempt to course of the 200,000 token haystack check. They might not full this analysis as a result of the doc dimension exceeded their most context window capability of 200,000 tokens. This can be a important constraint in comparison with opponents like Google’s Gemini, which handles context home windows exceeding a million tokens, and OpenAI’s fashions with considerably bigger processing capabilities.

This limitation has sensible implications for customers working with intensive documentation. Authorized professionals analyzing prolonged contracts, researchers processing complete tutorial papers, or analysts reviewing detailed monetary studies might discover Claude’s context restrictions problematic. The lack to course of the complete 200,000 token check means that real-world paperwork approaching this dimension might set off truncation or require handbook segmentation.

Verdict: Gemini is the higher mannequin for lengthy context duties

You may verify on each the necessity and the haystack, right here.

Conclusion

Claude 4 is nice, and higher than ever—but it surely’s not for everybody.

Energy customers who want its creativity and coding capabilities shall be very happy. Its understanding of human dynamics additionally makes it excellent for enterprise strategists, communications professionals, and anybody needing refined evaluation of multi-stakeholder situations. The mannequin’s clear reasoning course of additionally advantages educators and researchers who want to grasp AI decision-making paths.

Nevertheless, novice customers wanting the complete AI expertise might discover the chatbot slightly lackluster. It does not generate video, you can not speak to it, and the interface is much less polished than what yow will discover in Gemini or ChatGPT.

The 200,000 token context window limitation impacts Claude customers processing prolonged paperwork or sustaining prolonged conversations, and it additionally implements a really strict quota which will have an effect on customers anticipating lengthy classes.

In our opinion, it’s a stable “sure” for artistic writers and vibe coders. Different forms of customers may have some consideration, evaluating professionals and cons towards options.

Edited by Andrew Hayward

Usually Clever Publication

A weekly AI journey narrated by Gen, a generative AI mannequin.

Supply hyperlink

What's Hot

Regulating the Increase: Africa’s Daring Strikes on Cryptocurrency

Bitcoin Newest Inexperienced Candle Sparks Questions – Is A Actual Reversal In Sight?

XRP Will get Uncommon Funding From Hyperliquid Whale – U.At the moment

Anthropic Claude 4 Evaluation: Artistic Genius Trapped by Outdated Limitations – Decrypt

Usually Clever Publication

Regulating the Increase: Africa’s Daring Strikes on Cryptocurrency

AVAX Worth Prediction: Focusing on $24-26 Vary Inside 30 Days Amid Combined Technical Alerts

Dogecoin Sees Uncommon 179,110% Liquidation Imbalance in Huge Bear Wipeout – U.As we speak

LINK Value Prediction: Focusing on $25.41 Restoration After Crucial $14.04 Help Check

Bitcoin Newest Inexperienced Candle Sparks Questions – Is A Actual Reversal In Sight?

Rumble Strengthens Creator Income with Bitcoin and Tether Assist

Bitcoin Value Soars as US CPI Numbers Are available in Decrease Than Anticipated

Bitcoin Worth to $120,000? On-Chain Information Reveals Unfavourable Momentum Eased – U.In the present day

CRO Jumps 5% Day by day, BTC Worth Calms After Publish-CPI Volatility: Weekend Watch

SpaceX Transfers $134 Million in Bitcoin to New Wallets – Bitbo

Bitcoin Mining Enters Its Most Aggressive Section But

BCH Value Prediction: Bitcoin Money Targets $580-620 by November 2025 Regardless of Combined Analyst Views

Top Insights

Malaysia’s Largest Vitality Agency Studies 300% Rise in Crypto-Linked Energy Theft – Decrypt

WisdomTree joins race for spot XRP ETF as 4th submitting amid impending SEC management change

Can Cardano’s New 50M ADA Stablecoin Technique Revive DeFi Adoption? – BlockNews

What's Hot

Anthropic Claude 4 Evaluation: Artistic Genius Trapped by Outdated Limitations – Decrypt

Artistic writing

Coding

Mathematical reasoning

Non-mathematical reasoning and communication

Needle within the haystack

Conclusion

Usually Clever Publication

Related Posts

Subscribe to Updates