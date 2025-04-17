Each time a frontier lab released a major new model in 2024, I put it through its paces and wrote about the experience. But in the new year I have not kept up with this practice. This is partly because companies have released so many new models. But it’s also because benchmarking is getting more difficult.

In early 2024, I could trick leading LLMs with simple brain teasers like “what weighs more a pound of bricks or two pounds of feathers?” Google’s Gemini 1.0 Ultra model, for example, claimed they weighed the same.

Today’s models are much smarter. They are rarely fooled by simple trick questions. Over the last year I’ve developed questions that fool newer models, but they’ve gotten increasingly baroque and disconnected from real-world use cases.

So I need to recalibrate my approach. Part of that will be talking to more people about how models perform in the real world. Another part will be writing about AI products (like Deep Research, Operator, and Claude Code) rather than stand-alone models.

Still, readers might appreciate a brief primer on the major model releases that have occurred in the two months since I wrote about o3-mini in February. OpenAI, Anthropic, Google, Meta, and xAI have all had major releases during this period. In this article I’ll briefly describe the strengths and weaknesses of each one.

Several new models from OpenAI

OpenAI executives presented the new o3 and o4-mini models in a Wednesday video. (OpenAI)

Ever since the release of ChatGPT in 2022, OpenAI has enjoyed narrative command over the chatbot market. OpenAI is to chatbots what Tesla is to electric cars and Apple is to smartphones. ChatGPT recently became the world’s most downloaded mobile app by some measures. And CEO Sam Altman recently hinted that ChatGPT could have as many as one billion weekly active users.

“There will be a lot of people with great models,” Altman said. “We will try to build the best product.”

As OpenAI works to cement its lead in the consumer chatbot market, it is also experimenting with multiple new models that push the state of the art in different directions: