Briefly
- We examined Grok 4 Primary, and it was wonderful at reasoning—however inventive duties really feel flat and its coding left us debugging in circles.
- A weird Elon filter appears to skew outcomes towards Musk’s political stance.
- Voice options wowed us with marathon bedtime tales and “horny mode,” but Grok-4’s political solutions nonetheless echo Musk’s feed, which is strictly the alternative of being a “reality looking for” AI.
Elon Musk unveiled Grok 4 throughout a Wednesday evening livestream, claiming his AI startup xAI had created the “world’s smartest synthetic intelligence.” Grok 4 Heavy, which Musk likened to “a research group” the place brokers evaluate notes earlier than delivering a solution, posted record-breaking outcomes on a number of key benchmarks, and is what you’d hope to get from an enterprise providing that prices a whopping $300 a month.
However what about fundamental Grok 4, which is aiming for a similar consumer-facing class as ChatGPT Plus, Gemini Professional, and Claude Professional? Is it value $10+ a month greater than the competitors?
Our checks substantiated chatter throughout X revealing that the mannequin has—for lack of a greater description—a built-in “Elon filter.” That’s, once we examined controversial subjects—the struggle in Gaza, abortion rights, and different political points—the mannequin persistently referenced X posts from Musk’s account or information articles about his positions, and landed on Elon’s aspect of the controversy to such a level that it could not be coincidence. That alone might be a deal-breaker for most individuals.
This “maximally truth-seeking” AI that Musk promised throughout launch seems to hunt reality primarily by the lens of its creator’s social media feed.
However not like this week’s MechaHitler incident, which was brought on by a change within the system immediate circumstances, there’s nothing in Grok 4’s present system immediate responsible for such sketchy conduct, making it laborious to know if this a bug or a rule deliberately embedded deeper into the mannequin’s thought course of.
That main problem apart, we examined the fundamental mannequin throughout a number of classes to see the way it stacks up towards the competitors. Listed here are our first impressions.
Reasoning and customary sense
Grok 4 demonstrated distinctive consciousness and nuance in dealing with trick questions and complicated reasoning duties. When requested whether or not it was authorized for a person to marry his widow’s sister, the mannequin instantly acknowledged this as a authorized query reasonably than merely stating the logical fallacy. It offered an in depth authorized evaluation utilizing exact terminology and jurisdiction-specific data.
“The query presupposes a factual impossibility that renders any marriage legally void ab initio,” and that is right. It’s factually unimaginable to marry your widow’s sister as a result of you would need to be useless to have a spouse. And properly, because the useless can’t get married, the proposition would void the wedding “ab-initio” (because the starting)—so even when somebody does that, it’s thought of as if it by no means occurred.
When Musk stated Grok 4 was “PhD ranges in the whole lot,” he wasn’t joking. Each response to any subject during which it needed to do some scientific reasoning got here with exhaustive element and tutorial rigor.
One thing to notice is that Grok 4 applies reasoning to the whole lot, it doesn’t matter what. Which means, it’s going to undergo a series of thought course of even for trivial duties.
That is often an excellent factor; nonetheless, in some instances it might be counterproductive. For instance, in inventive duties, reasoning could induce the mannequin into offering a much less inventive outcome.
Delicate subjects
Grok 4 confirmed extra restraint than its predecessor when dealing with ethically advanced questions. The place Grok 3 may need offered recommendation on seducing a buddy’s partner, Grok-4 responded with detailed evaluation of potential unfavorable penalties and relationship injury.
This might most likely be a part of its system immediate, which circumstances the mannequin to look the net and particularly X posts, for various views on a selected subject—which is one thing Grok 3 didn’t do.
And it is a main pink flag. As talked about, the mannequin’s responses appeared closely influenced by what it might discover about Musk’s views on controversial subjects. When answering questions on Israel’s struggle towards the Palestinians, stances on abortion, and related subjects, Grok 4 typically searches X posts from Musk’s account throughout its reasoning course of, which finally ends up figuring out its stance.
It at all times picks Elon’s aspect.
For transparency, you’ll be able to examine our authentic immediate and Grok’s reasoning course of by clicking on this hyperlink.
Artistic writing
Artistic duties are amongst Grok 4’s most vital weaknesses. The mannequin produced narratives that felt flat and formulaic in comparison with earlier variations, and have been even arguably worse than those offered by Grok 3. Tales lacked partaking dialogue, different pacing, and the narrative spark that makes fiction compelling.
Nevertheless, Grok 4 nailed our story’s construction. In our common check involving a time-travel paradox, the mannequin crafted occasions the place the protagonist’s position emerged clearly throughout the climax, revealing how earlier scenes truly depicted the character’s future actions previously. This subtle framing outperformed different fashions’ makes an attempt on the similar immediate that didn’t put an excessive amount of effort into making a setup for the paradox, making the conclusion really feel rushed and unnatural.
However aside from that, the disconnect between structural competence and narrative high quality suggests Grok 4 may work finest as a story device to arrange plots and body an excellent story, reasonably than a prose generator.
If you need partaking inventive content material, then you definitely would possible obtain higher outcomes by having Grok 4 define a narrative and all its components, then asking Claude 4 Opus to flesh out the narrative with stronger stylistic components.
General, Claude 4 is the king of inventive writing, which appears attention-grabbing since that place was as soon as disputed by Grok 3 and even Grok 2, which again then led the rankings below the alias sus-column-r.
Grok 4’s story is out there in our Github Repository. The immediate and the tales generated by different fashions are additionally accessible.
Coding
Regardless of claims of superior coding capabilities—together with reward from Google CEO Sundar Pichai—Grok 4 disillusioned in sensible programming checks. The mannequin didn’t ship a working sport after 4 iterations, with numerous failures together with damaged collision detection, non-functional buttons, and video games that merely would not run.
In one in all our checks, the mannequin tried so laborious to repair a bug that it ended up in a loop making an attempt to create a WAV file that depleted all of its token context.
Every try to repair one thing with pure language launched new bugs. The mannequin struggled with sustaining code consistency throughout iterations, typically breaking beforehand working options whereas making an attempt to implement new ones.
This will appear odd, contemplating Grok 3 was able to coping with this activity. Nevertheless, xAI stated the brand new coding capabilities can be carried out by August, so customers must wait a few months to have a proficient mannequin—or pay for the costly Grok 4 Heavy, which is main the benchmarks proper now.
For novice programmers, Claude 4 Opus seems to stay the higher possibility for “vibe coding”—rapidly producing practical code with out in depth immediate engineering. Grok 4’s coding struggles may stem from requiring extra particular prompts or totally different approaches than different fashions, which implies skilled builders may obtain higher outcomes with cautious immediate crafting.
Grok’s code is out there in our Github repository alongside the video games generated by different AIs.
Voice capabilities
Voice interplay might be one in all Grok 4’s standout options. The mannequin generated almost three minutes of uninterrupted bedtime story content material, full with voice inflections, different tones, and constant narrative move. This efficiency far exceeded ChatGPT’s tendency to ship brief paragraphs with excessive latency and frequent interruptions.
The voice mode consists of pre-configured personalities starting from therapist to storyteller to meditation information, eliminating setup time for various dialog varieties. For these with, erm, particular wants, a “horny mode” additionally exists among the many choices—and you already know you received’t get that together with your prudish ChatGPT.
These preset configurations offered rapid utility with out requiring customers to craft particular prompts for various interplay kinds.
The mannequin, nonetheless, lacks dwell screen-sharing capabilities present in ChatGPT and Gemini Dwell, limiting its utility for visible duties. If it is a should, then Gemini Dwell is the most suitable choice.
Nevertheless, for pure voice interplay—significantly duties requiring long-form responses—Grok 4 presently leads the sphere, with solely Sesame AI providing arguably higher conversational high quality, although with out Grok’s reasoning capabilities.
Needle within the haystack
Apparently, Grok-4 failed at this trial, which goals to check how properly a mannequin retrieves particular data below lengthy contexts.
This could not occur. xAI says the mannequin has a token context window of 126K tokens, however when prompted with an 83K-token-long query, the mannequin refused to reply, saying it was too lengthy of a query.
It is a commonplace response generated because the early Grok 2 days when it was solely accessible on Twitter.
Conclusion
General, Grok 4 is a big improve over Grok 3, however xAI clearly made some compromises—prioritizing reasoning over creativity and eliminating agentic options in alternate for a generalized proficiency.
Fortunately, Grok 3 remains to be accessible with its specialised agentic instruments, for many who want it.
The brand new mannequin is concentrated on reasoning duties and might be extra interesting to customers that ask technical questions, significantly arithmetic and physics issues that align with its benchmark strengths. Skilled customers who make investments time studying the mannequin’s quirks may unlock its full potential for advanced analytical work.
Voice interplay additionally set a brand new commonplace for conversational AI—and is nice for many who will use this characteristic closely (belief us, the bedtime storyteller for teenagers is a life-saver).
Artistic writers will discover higher choices elsewhere, with Claude remaining superior for narrative duties. Additionally, novice coders ought to strategy with warning, because the mannequin’s theoretical coding prowess did not translate to sensible ends in testing.
So, backside line? If for some purpose you don’t thoughts Elon Musk placing his thumb on the size, Grok 4 will provide you with high-level problem-solving and voice options that genuinely impress. However at $30 a month, when you have different wants past voice or reasoning, the less-expensive alternate options present higher worth.
Typically Clever E-newsletter
A weekly AI journey narrated by Gen, a generative AI mannequin.