Google's new "thinking" model isn't quite as good as o1
But the small, fast model might allow Google to undercut OpenAI on price.
Normal companies avoid announcing new products in the final days before Christmas, but OpenAI and Google’s DeepMind are not normal companies. Both companies made significant announcements in December; I plan to write about some of them over the next couple of weeks. Let’s start with one of the month’s most underrated announcements: the December 19 release of Gemini 2.0 Flash Thinking Experimental.
That name requires unpacking:
Gemini 2.0 is the new family of models Google announced on December 11.
Flash suggests this is the smallest member of that family—though so far no others have been announced
Thinking means it behaves like OpenAI’s o1: it generates “thinking tokens” before producing a response.
Experimental means it’s a preview for developers, not a commercial product.
OpenAI chose to name its first thinking model o1 (rather than, say, GPT-4.5) to signal that this was a new kind of model. Despite this, it’s widely believed that o1 is a variant of a conventional LLM like GPT-4o.
Google isn’t being so coy about this. Calling the model “Gemini 2.0 Flash Thinking” clearly signals that it’s a modification of the vanilla Flash model.
In my last post, I predicted an industry-wide pivot away from conventional scaling focused on ever-larger models. I said other leading labs were likely to follow in OpenAI’s footsteps, using a technique called reinforcement learning to train models that produce longer and more accurate chains of thought. It seems Google is doing exactly that.
But unlike OpenAI, Google is not hiding the “thinking tokens” generated by its new model. So we can look under the hood to better understand its reasoning process.
So how well does Google’s latest model perform?
When I reviewed o1 in September, I wrote that it was “easily the biggest jump in reasoning capabilities since the original GPT-4.” Google’s new model is not quite on the level of o1, but it’s pretty close. It’s head and shoulders above any of the non-“thinking” LLMs I’ve tried.
Google’s model doesn’t get tripped up by any of the brain teasers that frequently confuse GPT-4-class models. For example, Gemini 2.0 Flash Thinking knows that 100 pennies are worth more than three quarters, that 9.9 is more than 9.11, that filling a car with helium won’t cause it to float away, and that 20 pairs of six-sided dice cannot add up to 250 or more.
The launch of o1 in September dramatically raised the bar for testing LLMs, forcing me to come up with some new benchmarks. In this article I’ll apply some of the same benchmarks to Google’s new model. As we’ll see, Google has not quite caught up to the performance of OpenAI’s o1 models, to say nothing of the o3 models that are due out early next year.
But Google isn’t that far behind either. And given that this is a Flash model—the brand Google typically uses for the smallest and cheapest version of a model—I expect Google to release more powerful models in the coming months.
Google hasn’t announced pricing for the Gemini 2.0 family of models, but Gemini 1.5 Flash costs 7.5 cents per million input tokens and 30 cents per million output tokens. OpenAI’s cheapest model, GPT-4o mini, costs twice as much. And OpenAI’s cheapest “thinking” model, o1 mini, costs 40 times as much.
I assume Google will charge a premium for its “thinking” model just as OpenAI does. Still, there could be a lot of room for Google to beat OpenAI on price. So even though Google’s new thinking model isn’t quite comparable to o1 in reasoning performance, I see it as a very significant release.
Keep reading with a 7-day free trial
Subscribe to Understanding AI to keep reading this post and get 7 days of free access to the full post archives.