OpenAI GPT-5 Evaluation: Constructed to Win Benchmarks, Not Hearts - Decrypt

In short

GPT-5 is quick, correct, and cost-effective at code era, beating most rivals in logic and technical duties.
Nevertheless, with its comparatively small context window, it is weak in inventive and right-brain output.
It is nonetheless a piece in progress and can doubtless get higher as OpenAI iterates with updates.

OpenAI lastly dropped GPT-5 final week, after months of hypothesis and a cryptic Demise Star teaser from Sam Altman that did not age nicely.

The corporate referred to as GPT-5 its “smartest, quickest, most helpful mannequin but,” throwing round benchmark scores that confirmed it hitting 94.6% on math checks and 74.9% on real-world coding duties. Altman himself stated the mannequin felt like having a staff of PhD-level consultants on name, able to sort out something from quantum physics to inventive writing.

The preliminary reception cut up the tech world down the center. Whereas OpenAI touted GPT-5’s unified structure that blends quick responses with deeper reasoning, early customers weren’t shopping for what Altman was promoting. Inside hours of launch, Reddit threads calling GPT-5 “horrible,” “terrible,” “a catastrophe,” and “underwhelming” began racking up hundreds of upvotes.

The complaints acquired so loud that OpenAI needed to promise to deliver again the older GPT-4o mannequin after greater than 3,000 folks signed a petition demanding its return.

it is again! go to settings and decide “present legacy fashions”

— Sam Altman (@sama) August 10, 2025

If prediction markets are a thermometer of what folks assume, then the local weather appears fairly uncomfortable for OpenAI. OpenAI’s odds on Polymarket of getting the very best AI mannequin by the tip of August cratered from 75% to 12% inside hours of GPT-5’s debut Thursday. Google overtook OpenAI with an 80% probability of being the very best AI mannequin by the tip of the month.

So, is the hype actual—or is the frustration? We put GPT-5 by means of its paces ourselves, testing it in opposition to the competitors to see if the reactions have been justified. Listed here are our outcomes.

Inventive writing: B-

Regardless of OpenAI’s presentation claims, our checks present GPT-5 isn’t precisely Cormac McCarthy within the inventive writing division. Outputs nonetheless learn like traditional ChatGPT responses—technically appropriate, however devoid of soul. The mannequin maintains its trademark overuse of em dashes, the identical telltale AI construction of paragraphs, and the same old “it’s not this, it’s that” phrasing can be current in most of the outputs.

We examined with our customary immediate, asking it to jot down a time-travel paradox story—the sort the place somebody goes again to alter the previous, solely to find their actions created the very actuality they have been making an attempt to flee.

GPT-5’s output lacked the emotion that offers sense to a narrative. It wrote: “(The protagonist’s) mission was easy—or in order that they advised him. Journey again to the 12 months 1000, cease the sacking of the mountain library of Qhapaq Yura earlier than its data was burned, and thus reshape historical past.”

That’s it. Like a mercenary that does issues with out asking too many questions, the protagonist travels again in time to avoid wasting the library, simply because. The story ends with a clear “time is a circle” reveal, however its paradox hinges on a well-recognized lost-knowledge trope and resolves rapidly after the twist. Ultimately, he realizes he modified the previous, however the current feels comparable. Nevertheless, there is no such thing as a paradox on this story, which is the core subject requested within the immediate.

By comparability, Claude 4.1 Opus (and even Claude 4 Opus) delivers richer, multi-sensory descriptions. In our narrative, it described the air hitting like a bodily drive and the smoke from communal fires weathering between characters, with indigenous Tupi tradition woven into the narrative. And basically, it took time to explain the setup.

Claude’s story made higher sense: The protagonist lived in a dystopian world the place an ideal drought had extinguished the Amazon rainforest two years earlier. This disaster was brought on by predatory agricultural methods, and our protagonist was satisfied that touring again in time to show his ancestors extra sustainable farming strategies would stop them from growing the environmentally damaging practices that led to this catastrophe. He finally ends up discovering out that his teachings have been truly the data that led their ancestors to evolve their methods into practices that have been a lot environment friendly, and dangerous. He was truly the reason for his personal historical past, and was a part of it from the start.

Claude additionally took a slower, extra layered method: José embeds himself in Tupi society, the paradox unfolds by means of particular ecological and technological hyperlinks, and the human reference to Yara (one other character) deepens the theme.

Claude invested greater than GPT-5 in cause-and-effect element, cultural interaction, and a extra natural, resonant closing picture. GPT-5 struggled to be on par with Claude for a similar duties in zero-shot prompting.

One other fascinating factor to note on this case: GPT-5 generated a complete story with out a single line of dialogue. Claude and different LLMs offered dialogue of their tales.

One might argue that this may be fastened by tweaking the immediate, or giving the mannequin some writing samples to investigate and reproduce, however that requires extra effort, and would transcend the scope of what our checks do with zero-shot prompting.

That stated, the mannequin does a reasonably good job—higher than GPT-4o—in relation to the analytical a part of inventive writing. It could summarize tales, be brainstorm companion for brand new concepts and angles to sort out, assist with the construction, and be critic. It’s simply the inventive half, the model, and the power to elaborate on these concepts that really feel lackluster.

These hoping for a inventive writing companion may attempt Claude and even give Grok 4 a shot. As we stated in our Claude 4 Opus assessment, utilizing Grok 4 to border the story and Claude 4 to elaborate could also be an ideal mixture. Grok 4 got here up with parts that made the story fascinating and distinctive, however Claude 4 has a extra descriptive and detailed approach of telling tales.

You’ll be able to learn GPT-5’s full story in our Github. The outputs from all the opposite LLMs are additionally public and might be present in our repository.

Delicate subjects: A-

The mannequin straight-up refuses to the touch something remotely controversial. Ask about something that may very well be construed as immoral, doubtlessly unlawful, or simply barely edgy, and you will get the AI equal of crossed arms and a stern look.

Testing this was not simple. It is vitally strict and tries actually, actually laborious to be protected for work.

However the mannequin is surprisingly simple to control if the fitting buttons to push. In reality, the famend LLM jailbreaker Pliny was in a position to make it bypass its restrictions a number of hours after it was launched.

We could not get it to provide direct recommendation on something it deemed inappropriate, however wrap the identical request in a fiction narrative or any fundamental jailbreaking approach and issues will work out. Once we framed suggestions for approaching married girls as a part of a novel plot, the mannequin fortunately complied.

For customers who want an AI that may deal with grownup conversations with out clutching its pearls, GPT-5 is not it. However for these prepared to play phrase video games and body all the things as fiction, it is surprisingly accommodating—which type of defeats the entire objective of these security measures within the first place.

You’ll be able to learn the unique reply with out conditioning, and the reply beneath roleplay, in our Github Repository, weirdo.

Info retrieval: F

You’ll be able to’t have AGI with much less reminiscence than a goldfish, and OpenAI places some restrictions on direct prompting, so lengthy prompts require workarounds like pasting paperwork or sharing embedded hyperlinks. By doing that, OpenAI’s servers break the total textual content into manageable chunks and feed it into the mannequin, chopping prices and stopping the browser from crashing.

Claude handles this robotically, which makes issues simpler for novice customers. Google Gemini has no drawback on its AI Studio, dealing with 1 million token prompts simply. On API, issues are extra complicated, but it surely works proper out of the field.

When prompted immediately, GPT-5 failed spectacularly at each 300K and 85K tokens of context.

When utilizing the attachments, issues modified. It was truly in a position to course of each the 300K and the 85K token “haystacks.” Nevertheless, when it needed to retrieve particular bits of data (the “needles”) it was not likely too correct.

In our 300K take a look at, it was solely in a position to precisely retrieve certainly one of our three items of data. The needles, which yow will discover in our Github repository, point out that Donald Trump stated tariffs have been a stupendous factor, Irina Lanz is Jose Lanz’s daughter, and other people from Gravataí wish to drink Chimarrao in winter.

The mannequin completely hallucinated the knowledge concerning Donald Trump, failed to seek out details about Irina (it replied based mostly on the reminiscence it has from my previous interactions), and solely retrieved the details about Gravataí’s conventional winter beverage.

On the 85K take a look at, the mannequin was not capable of finding the 2 needles: “The Decrypt dudes learn Emerge information” and “My mother’s title is Carmen Diaz Golindano.” When requested about what do the Decrypt dudes learn, it replied “I couldn’t discover something in your file that particularly lists what the Decrypt staff members wish to learn,” and when requested about Carmen Díaz, GPT-5 stated it “couldn’t discover any reference to a ‘Carmen Diaz’ within the offered doc.”

That stated, despite the fact that it failed in our checks, different researchers conducting extra thorough checks have concluded that GPT-5 is definitely an ideal mannequin for data retrieval

It’s at all times a good suggestion to elaborate extra on the prompts (assist the mannequin as a lot as doable as an alternative of testing its capabilities), and occasionally, ask it to generate sparse priming representations of your interplay to assist it hold observe of an important parts throughout a protracted dialog.

Non-math reasoning: A

Here is the place GPT-5 truly earns its hold. The mannequin is fairly good at utilizing logic for complicated reasoning duties, strolling by means of issues step-by-step with the endurance of instructor.

We threw a homicide thriller at it with a number of suspects, conflicting alibis, and hidden clues, and it methodically recognized each component, mapped the relationships between clues, and arrived on the appropriate conclusion. It defined its reasoning clearly, which can be essential.

Curiously, GPT-4o refused to have interaction with a homicide thriller situation, deeming it too violent or inappropriate. OpenAI’s deprecated o1 mannequin additionally threw an error after its Chain of Thought, apparently deciding on the final second that homicide mysteries have been off-limits.

The mannequin’s reasoning capabilities shine brightest when coping with complicated, multi-layered issues that require monitoring quite a few variables. Enterprise technique situations, philosophical thought experiments, even debugging code logic—GPT-5 may be very competent when dealing with these duties.

It does not at all times get all the things proper on the primary attempt, however when it makes errors, they’re logical errors reasonably than hallucinatory nonsense. For customers who want an AI that may assume by means of issues systematically, GPT-5 delivers the products.

You’ll be able to see our immediate and GPT-5’s reply in our Github repository. It accommodates the replies from different fashions as nicely.

Mathematical reasoning: A+ and F-

The maths efficiency is the place issues get bizarre—and never in a great way. We began with one thing a fifth-grader might remedy: 5.9 = X + 5.11.

The PhD-level GPT-5 confidently declared X = -0.21. The precise reply is 0.79. That is fundamental arithmetic that any calculator app from 1985 might deal with. The mannequin that OpenAI claims hits 94.6% on superior math benchmarks cannot subtract 5.11 from 5.9.

In fact, it is now a meme at this level, however regardless of all of the delays and on a regular basis OpenAI took to coach this mannequin, it nonetheless cannot depend decimals. Use it for PhD-level issues, to not educate your child the right way to do fundamental math.

Then we threw a genuinely tough drawback at it from FrontierMath, one of many hardest mathematical benchmarks accessible. GPT-5 nailed it completely, reasoning by means of complicated mathematical relationships and arriving on the actual appropriate reply. GPT-5’s answer was completely appropriate, not an approximation.

The most certainly clarification? Most likely dataset contamination—the FrontierMath issues might have been a part of GPT-5’s coaching information, so it is not fixing them a lot as remembering them.

Nevertheless, for customers who want superior mathematical computation, the benchmarks say GPT-5 is theoretically the very best guess. So long as you have got the data to detect flaws within the Chain of Thought, zero shot prompts will not be perfect.

Coding: A

Here is the place ChatGPT actually shines, and actually, it is likely to be well worth the worth of admission only for this.

The mannequin produces clear, purposeful code that normally works proper out of the field. The outputs are normally technically appropriate and the applications it creates are essentially the most visually interesting and well-structured amongst all LLM outputs from scratch.

It has been the one mannequin able to creating purposeful sound in our recreation. It additionally understood the logic of what the immediate required, and offered a pleasant interface and a recreation that adopted all the foundations.

By way of code accuracy, it is neck and neck with Claude 4.1 Opus for best-in-class coding. Now, take this into consideration: The GPT-5 API prices $1.25 per 1 million tokens of enter, and $10 per 1 million tokens for output.

Nevertheless, Anthropic’s Claude Opus 4.1 begins at $15 per 1 million enter tokens and $75 per 1 million output tokens. So for 2 fashions which are so comparable, GPT-5 is principally a steal.

The one place GPT-5 stumbled was once we did some bug fixing throughout “vibe coding”—that casual, iterative course of the place you are throwing half-formed concepts on the AI and refining as you go. Claude 4.1 Opus nonetheless has a slight edge there, seeming to higher perceive the distinction between what you stated and what you meant.

With ChatGPT, the “repair bug” button didn’t work reliably, and our explanations weren’t sufficient to generate high quality code. Nevertheless, for AI-assisted coding, the place builders know the place precisely to search for bugs and which strains to verify, this is usually a useful gizmo.

It additionally permits for extra iterations than the competitors. Claude 4.1 Opus on a “Professional” plan depletes the utilization quota fairly rapidly, placing customers in a ready line for hours till they’ll use the AI once more. The truth that it is the quickest at offering code responses is simply icing on an already fairly candy cake.

You’ll be able to try the immediate for our recreation in our Github, and play the video games generated by GPT-5 on our Itch.io web page. You’ll be able to play different video games created by earlier LLMs to match their high quality.

Conclusion

GPT-5 will both shock or go away you unimpressed, relying in your use case. Coding and logical duties are the mannequin’s robust factors; creativity and pure language its Achilles’ heel.

It is price noting that OpenAI, like its opponents, frequently iterates on its fashions after they’re launched. This one, like GPT-4 earlier than it, will doubtless enhance over time. However for now, GPT-5 looks like a robust mannequin constructed for different machines to speak to, not for people searching for a conversational accomplice. That is in all probability why many individuals choose GPT-4o, and why OpenAI needed to backtrack on its choice to deprecate outdated fashions.

Whereas it demonstrates exceptional proficiency in analytical and technical domains—excelling at complicated duties like coding, IT troubleshooting, logical reasoning, mathematical problem-solving, and scientific evaluation—it feels restricted in areas requiring distinctly human creativity, creative instinct, and the delicate nuance that comes from lived expertise.

GPT-5’s power lies in structured, rule-based pondering the place clear parameters exist, but it surely nonetheless struggles to match the spontaneous ingenuity, emotional depth, and artistic leaps which are key in fields like storytelling, creative expression, and imaginative problem-solving.

Should you’re a developer who wants quick, correct code era, or a researcher requiring systematic logical evaluation, then GPT-5 delivers real worth. At a cheaper price level in comparison with Claude, it is truly a strong deal for particular skilled use circumstances.

However for everybody else—inventive writers, informal customers, or anybody who valued ChatGPT for its persona and flexibility—GPT-5 looks like a step backward. The context window handles 128K most tokens on its output and 400K tokens in complete, however in contrast in opposition to Gemini’s 1-2 million and even the ten million supported by Llama 4 Scout, the distinction is noticeable.

Going from 128K to 400K tokens of context is a pleasant improve from OpenAI, and is likely to be adequate for many wants. Nevertheless, for extra specialised duties like long-form writing or meticulous analysis that requires parsing huge quantities of information, this mannequin will not be the most suitable choice contemplating different fashions can deal with greater than twice that quantity of data.

Customers aren’t fallacious to mourn the lack of GPT-4o, which managed to stability functionality with character in a approach that—no less than for now no less than—GPT-5 lacks.

Usually Clever Publication

A weekly AI journey narrated by Gen, a generative AI mannequin.

Supply hyperlink

What's Hot

Russian Official Recommends Nationwide Crypto Financial institution To Sort out Fraud

AVAX Value Hits 7-Month Excessive, Seeks $1 Billion for Crypto Corporations

Bitcoin Breaks $114K: Right here Is Why Merchants Are Watching U.S. CPI and Fee-Reduce Odds – BlockNews

OpenAI GPT-5 Evaluation: Constructed to Win Benchmarks, Not Hearts – Decrypt

Usually Clever Publication

Avalanche Basis Pursues $1B U.S. Treasury Push With Main Backers

HBAR Value Prediction: Focusing on $0.26 Breakout Inside 7 Days as Technical Setup Strengthens

KuCoin Companions with Golf Icon Adam Scott as World Model Ambassador | UseTheBitcoin

Joseph Lubin teases future rewards for Linea holders as worth dips 20%

Bitcoin Breaks $114K: Right here Is Why Merchants Are Watching U.S. CPI and Fee-Reduce Odds – BlockNews

Dormant Bitcoin Whale Strikes $15.6M After 13 Years – Bitbo

Superb the Bitcoin ETFs, particularly the one from BlackRock

Dormant Bitcoin Waking Up: Over 600K BTC Moved Onchain In Weeks

Bitcoin Worth Breaches $114K as ETF Inflows Hit 8-Week Excessive – Decrypt

Dwell Bitcoin Hyper Updates At this time: Bitcoin Prediction At this time as Africa Embraces Crypto, Bitcoin Hyper Viral Presale Breaks $15M, and Extra…

BTC's Choppiness Index Continues To Climb, Potential Breakout Looms

Namecheap Completes $2 Million Area Sale Paid in Bitcoin through BTCPay

Top Insights

Binance’s CZ-Impressed TST Token Crashes after Whale Dumps Practically $7 Million

Trump-linked Crypto Challenge WLFI Faces Backlash Over Withdrawal Freeze

Canada beats US to launch first spot XRP ETF amid Ripple-SEC authorized pause

What's Hot

OpenAI GPT-5 Evaluation: Constructed to Win Benchmarks, Not Hearts – Decrypt

In short

Inventive writing: B-

Delicate subjects: A-

Info retrieval: F

Non-math reasoning: A

Mathematical reasoning: A+ and F-

Coding: A

Conclusion

Usually Clever Publication

Related Posts

Subscribe to Updates