Meta Caught Cheating on LLM Benchmarks

Casey Newton, writing at his Platformer newsletter:

As I write this, a Meta model named Llama-4-Maverick-03-26-Experimental indeed has a score of 1417 on LMArena, which is enough to put it at second place —just behind Google’s highly regarded Gemini Pro 2.5 model, and just ahead of ChatGPT 4o. It’s an impressive showing that lends credence to CEO Mark Zuckerberg’s core belief in more open development, which is that it can improve upon the performance of closed models by crowdsourcing its development from many more contributors. And it’s no wonder the company promoted it in its announcement materials.

Within a day, though, observers were pointing out that there is something misleading about Meta’s announcement. Namely, the version of Maverick that nearly topped LMArena isn’t the version you can download — rather, it’s a custom version of Llama that Meta seemingly developed with the express purpose of excelling at LMArena…

Meta, for its part, denies the “teaching to the test” allegations.

“We’ve also heard claims that we trained on test sets – that’s simply not true and we would never do that,” said Ahmad Al-Dahle, who leads generative AI at Meta, in a post on X. “Our best understanding is that the variable quality people are seeing is due to needing to stabilize implementations.”

I don’t know what it means to “stabilize an implementation,” or how it might relate to any of the above. When I asked Meta for further explanation, it suggested that its experimental version of Llama 4 just happened to be really good at LMArena, and was not expressly designed for that purpose.

Meta is clearly lying and its statement is hands-caught-in-the-cookie-jar-level embarrassing. I mean this genuinely: I blurted out laughing at Newton writing that Meta suggested the experimental Llama 4 model was just “really good” at LMArena. Al-Dahle claims that the specialized version of Llama wasn’t trained on test sets, which I’m sure is true, but it entirely ignores that the “experimental” Llama model could’ve been trained to be better at LMArena. This particular line really stood out to me in Meta’s comment to Platformer: “We’re excited to see what they will build and look forward to their ongoing feedback.”

Sounds like something Karoline Leavitt, the White House press secretary, would say. I can’t emphasize how bad Meta is at public relations — it wants to be treated with respect so badly yet resorts to silly marketing gimmicks like proactively reaching out to journalists to slander a book it so desperately wants out of circulation or outfitting Zuckerberg with a new hairstyle and bronzer to appeal to the Make America Great Again squad of broccoli-cut Generation Z boys. What a series of unforced errors: It’s already bad enough to create a fake large language model to look good on benchmarks that most normal people don’t even care about, but it’s even worse to put out a hysterically bad statement when confronted about it by a journalist with a knack for this kind of tomfoolery.

Either way, the “experimental” Llama 4 Maverick model still remains on LMArena’s leaderboard just below Gemini 2.5 Pro. But this leaderboard, in general, is fascinating to me, and I’ve been meaning to write about it for a while. (Thanks, Meta, for providing a convenient time for me to do so.) In the overall rankings, Grok 3 beats DeepSeek R1, which threw the generative artificial intelligence grifters of Silicon Valley into a frenzy in the hopes it would spark a war with China. But even Google’s open-source Gemma model beats Anthropic’s finest reasoning model, Claude 3.7 Sonnet, which I find to be one of the most intelligent models out there. Even GPT-4.5, which OpenAI claims isn’t smarter than GPT-4o, does better than Claude.

In coding performance, the fake version of Llama 4 Maverick takes the lead, but GPT-o3-mini high — OpenAI’s fanciest reasoning model it touts as “great for coding and logic” — underperforms the vanilla GPT-4o version by 61 points. OpenAI is so proud of o3-mini-high that it incessantly upsells people who use GPT-4o for programming questions to switch to the higher-end model, which has tight usage limits. But from the benchmark, it seems people don’t prefer it over the standard model, and they think responses from the latter are markedly better. The whole thing seems suspicious to me.

This is because LMArena is practically useless, thus making Meta’s little game of deception even more embarrassing. The benchmark allows users — mainly nerds who have nothing better to do than play with LLMs all day, and I say this as a nerd who loves toying with LMArena — to enter prompts, then compare the responses from two randomly selected models in a side-by-side blind competition. They then pick which one they like better before the names are revealed. The more users prefer an LLM response, the higher it moves in the ranks. The problem is that people don’t necessarily evaluate the models for thoroughness or accuracy in these tests — they’re more focused on how the model answers the question. That’s not necessarily a bad thing, but it’s far from a well-rounded evaluation.

GPT-4o is really nice to talk to — especially the latest one published late in March. It asks questions back, speaks less robotically, and has a sense of emotion palpable in its responses. When it works through a complicated problem, it explains things like a teacher rather than a robot and is generally quite pleasant in its word choice and demeanor. The more advanced o3 models, however, are more cold in their answers. They often get straight to the point, use too many bullet points and ordered lists, are reluctant to explain their thoughts outside of the chains of thought (which are condescending and sometimes even rude), and aren’t conversational in the slightest. What separates OpenAI’s reasoning models and Gemini 2.5 Pro is how they speak. While OpenAI’s reasoning models would probably score quite low on an emotional quotient test, Gemini tries to sound friendly and thorough. That explains the LMArena score.

I don’t think Gemini 2.5 Pro is the smartest reasoning model. I’d probably hand that award to either o3-mini-high or Claude 3.7 Sonnet, which falls behind considerably in the explanation department. But I generally prefer Claude’s answers the most of the three models when my question doesn’t require a large context window (Gemini) or real-time web search (ChatGPT). Its responses are so neatly formatted and not confusing to read. Gemini prefers long paragraphs in my experience while ChatGPT is way too reliant on nested lists and headers. Claude speaks in bullet points, too, but they actually make sense and are easy to skim while ChatGPT’s are all over the place, using numbered lists, bullet points, and paragraphs of text all under one heading. If there’s anything I hate about ChatGPT, it’s how it formats its responses.

All of this is to say I can see why Gemini and Llama 4 Maverick — some of the chattiest, friendliest models — take the top spots on LMArena while the smarter models fall behind. I take these benchmarks with a grain of salt and usually recommend models depending on what I think they’re best at:

GPT-4o: Everyday use with real-time knowledge and decent coding and writing skills.
Claude 3.7 Sonnet: Math and coding, especially when straightforward answers are the goal.
GPT-o3-mini: ChatGPT but less chatty and better at programming and logic.
Gemini: Exceptional in situations when a large context window is needed.
Llama 4: Great for interrupting your Instagram scrolling experience.