Claude 4.5 Opus and the State of AI Models in Late 2025

Anthropic, just before Thanksgiving:

Our newest model, Claude Opus 4.5, is available today. It’s intelligent, efficient, and the best model in the world for coding, agents, and computer use. It’s also meaningfully better at everyday tasks like deep research and working with slides and spreadsheets. Opus 4.5 is a step forward in what AI systems can do, and a preview of larger changes to how work gets done.

Claude Opus 4.5 is state-of-the-art on tests of real-world software engineering… As our Anthropic colleagues tested the model before release, we heard remarkably consistent feedback. Testers noted that Claude Opus 4.5 handles ambiguity and reasons about tradeoffs without hand-holding. They told us that, when pointed at a complex, multi-system bug, Opus 4.5 figures out the fix. They said that tasks that were near-impossible for Sonnet 4.5 just a few weeks ago are now within reach. Overall, our testers told us that Opus 4.5 just “gets it.”

Claude Opus 4.5 is easily the best large language model on the market for myriad reasons: It doesn’t speak corporate English, it’s phenomenal at programming, and it’s excellent at explanations. It is the only model that speaks remotely like a human being, and it is the only model that can write safe, efficient, and uncomplicated code. It appears my assertion that Gemini 3 Pro would be the “smartest model for the next 10 weeks” was a bit ambitious. But none of this surprises me: Anthropic’s models have a certain quality to them that makes them feel so nice to interact with. They’re candid, don’t try to be too clever, and push back when needed. I can’t believe there was a time when I discounted Anthropic’s competence.

Claude Opus 4.5 is, by all of the benchmarks, the smartest model for coding. LMArena, a website that asks people to blindly rank model responses, has it at No. 1 on the web development leaderboard, and it excels in all the benchmarks that Gemini 3 Pro previously owned just earlier in November. But I wouldn’t say its coding performance is that much better quantitatively than Claude 4.5 Sonnet or any of its competitors. If one gives a question to GPT 5.1 Codex Max, Gemini 3 Pro, Claude 4.5 Sonnet, and Claude 4.5 Opus, they’ll all conjure up more or less the same solution. The difference comes in how that solution is presented: Gemini is more verbose and messy, GPT-5.1 is terse and overcomplicates implementations, and Claude strikes a balance. Simon Willison describes this phenomenon well:

It’s clearly an excellent new model, but I did run into a catch. My preview expired at 8pm on Sunday when I still had a few remaining issues in the milestone for the alpha. I switched back to Claude Sonnet 4.5 and… kept on working at the same pace I’d been achieving with the new model.

With hindsight, production coding like this is a less effective way of evaluating the strengths of a new model than I had expected.

I’m not saying the new model isn’t an improvement on Sonnet 4.5—but I can’t say with confidence that the challenges I posed it were able to identify a meaningful difference in capabilities between the two.

This represents a growing problem for me. My favorite moments in AI are when a new model gives me the ability to do something that simply wasn’t possible before. In the past these have felt a lot more obvious, but today it’s often very difficult to find concrete examples that differentiate the new generation of models from their predecessors.

I agree with Willison and think this is an astute observation. I haven’t been able to try out Claude 4.5 Opus in Claude Code — my preferred way of using artificial intelligence to write code, since it lets me abandon the hell of Visual Studio Code — because I only subscribe to Anthropic’s Claude Pro plan, not Claude Max. I’m yet to encounter a problem Claude Sonnet 4.5 couldn’t solve. Sometimes it has required extra guidance and a bit of backtracking or examples, but it has always gotten the job done. Perhaps Claude Opus 4.5 would format those responses better so I wouldn’t have to do any manual refactoring after the fact, or maybe it could accomplish the same thing with a less detailed prompt. But these aren’t reasons for me to spend five times the money on an AI chatbot.

Again, I maintain Anthropic’s models are the best on the market, just empirically. Whatever Anthropic’s engineers are up to, they’re amazing at post-training LLMs. Claude’s personality is best in class, its code is remarkably professional, and the models follow instructions well. OpenAI’s models are trained to be great consumer-grade busywork assistants. When you ask for feedback on writing, GPT-5.1 will just rewrite the text using the most insufferable corporate tone anyone has ever heard. “I really hear you — and I can help,” it emphasizes. Gemini will do the same more emphatically but with uninspiring diction. Claude does not rewrite; it tells you what is wrong with what you wrote. I don’t use LLMs for writing advice because I can confidently say I’m a better writer than any of these robots, but this is a common benchmark I use to test model personality.

Anthropic is, to put it lightheartedly, just built differently. Of course Claude 4.5 Opus is the best model on the market — Anthropic is the only AI lab left with taste.