Elon Musk Said Grok 4 Was the “Smartest AI in the World,” But Its Leaderboard Scores Just Came Out and They Tell a Different Story

Elon Musk has been boasting about what he says are the incredible capabilities of xAI’s new Grok 4 AI chatbot.
“Grok 4 is smarter than almost all graduate students in all disciplines, simultaneously,” Musk bragged, adding that Grok 4 was “the smartest AI in the world.”
Is it really? Intelligence was a hard thing to measure even before back before AI hit the scene, but certain tests can provide something of a clue.
One prominent platform for doing so is the UC Berkeley-developed LMArena leaderboard, which crowdsources rankings on AI models by having users score their responses in categories ranging from creative writing and coding to math and vision.
In its latest scores, Grok 4 ranked third place overall and on text generation. Make no mistake, that’s impressive — but it’s still trailing behind advanced models from Google and OpenAI. (Specifically, Google’s Gemini 2.5 placed first and OpenAI’s o3 and 4o reasoning models tied for second, with GPT-4.5 tied with Grok 4 for third.)
While Grok is clearly a fearsome competitor in the arenas of racism and antisemitism, in other words, even its latest release clearly falls short of being the “smartest AI in the world.” (This isn’t entirely surprising; Musk has a long history of fibbing in his professional life, political activities, and even his hobbies.)
Perhaps the only saving grace for Grok is the suggestion, per expert criticism, that Berkeley’s chatbot arena may be more vibes-based than strictly scientific.
According to a recent study, conducted by a consortium of AI researchers and led by the machine learning firm Cohere, the leaderboard allegedly has a bunch of “systematic issues that have resulted in a distorted playing field.” Among the serious allegations raised by the researchers is the claim that the arena conducts “undisclosed private testing” before publicly releasing scores — and that rankings can be retracted at will.
Soon after the paper’s release, it was revealed that the version of Meta’s LLaMA 4 that had been used by the leaderboard wasn’t the same one that had been released publicly — a bait-and-switch ploy on Meta’s part to charm the human voters behind the arena.
Though an apology was issued and Meta was thrown under the bus for its sketchy attempts to rig the game, it was still a really bad look that marred the chatbot arena’s credibility. What that means for Grok, though? We’ll have to ask the smartest AI in the world.
More on Grok: The Pentagon Is Pumping $200 Million Into Elon Musk’s AI That Just Had a Nazi Meltdown