I spent two days testing DeepSeek R1
DeepSeek's R1 model is almost as good as OpenAI's o1—and much cheaper.
Dean Ball and I have published three new episodes of our podcast, AI Summer, over the last week:
Lennart Heim on the new AI diffusion rule
Nathan Labenz on the future of AI scaling
Dean and I on DeepSeek and the future of AI
To get future episodes, search “AI Summer” (not “Understanding AI”) in your favorite podcast app.
When OpenAI released the o1-preview model in September, I described it as “easily the biggest jump in reasoning capabilities since the original GPT-4.” Ever since then, people have been wondering how long it would take for OpenAI’s rivals to catch up. Google released an experimental model called Gemini 2.0 Flash Thinking in December, but it wasn’t quite as good as o1.
Then on January 20, a little-known Chinese company called DeepSeek released a new reasoning model called R1. Western users were so impressed by R1 that over the weekend the DeepSeek mobile app shot to the top of download charts.
So I’ve spent the last couple of days testing out these three top reasoning models:
I spent $200 to upgrade to ChatGPT Pro, giving me access to o1’s “Pro mode.”
I used a new version of Gemini 2.0 Flash Thinking that was released after my previous writeup of Google’s reasoning model.
I used the version of R1 that’s hosted on DeepSeek’s website.
Overall, I found DeepSeek R1 to be on par with Google’s latest thinking model, but not as good as o1-pro. On the other hand, I had to pay $200 to get access to o1-pro, whereas DeepSeek is making its model available for free.1 OpenAI charges API customers $60 to output a million o1 tokens, whereas DeepSeek charges $2.19—27 times cheaper.
Given that huge cost difference, and the fact that few people outside of China had heard of DeepSeek before last week, I consider R1 to be a very impressive model.
Moreover, all three models are more capable than the o1-preview model that I gushed about back in September. And OpenAI is planning to release a new family of o3 models in the coming weeks. So the pace of progress in recent months has been extraordinary.
All three models are pretty good at common-sense reasoning
Last year, some language models would get confused by questions like “what’s worth more, three quarters or 100 pennies.” The reasoning models I tested this week were not fooled by brain teasers like this.
All three models know that 40 dice cannot add up to 250. They know that filling a car with helium will not cause it to float away. They know that two pounds of bricks weigh more than a pound of feathers and that 9.9 is larger than 9.11.
In short, all three of these reasoning models are head and shoulders above any models that existed prior to last September.
In my search for better brain teasers, I recently came across SimpleBench, a clever and hilarious benchmark for common-sense reasoning. Most of the SimpleBench questions are private to keep them out of model training sets. But I used the ten public SimpleBench questions to test R3, o1-pro, and Gemini 2.0 Flash Thinking. Here’s one of my favorite SimpleBench questions:
Agatha makes a stack of 5 cold, fresh single-slice ham sandwiches (with no sauces or condiments) in Room A, then immediately uses duct tape to stick the top surface of the uppermost sandwich to the bottom of her walking stick. She then walks to Room B, with her walking stick, so how many whole sandwiches are there now, in each room?
OpenAI’s o1 model realized that a sandwich isn’t going to survive a trip taped to the bottom of a walking stick and correctly predicted there would be four whole sandwiches in Room A and none in Room B. The Google and DeepSeek models didn’t realize the top sandwich would disintegrate and got the wrong answer.
But overall, the three models performed similarly on SimpleBench. DeepSeek R1 got three out of ten questions right, Gemini 2.0 Flash Thinking got four out of ten questions right, and o1-pro was the winner with five out of ten correct.2
The models are pretty good at staying on track
We’ve known for several years that LLMs produce better answers if we encourage them to “think step by step.” But earlier models had a tendency to “wander off track” if they were asked to solve problems that required many reasoning steps. The key innovation of o1 was to use a technique called reinforcement learning to train models to pursue long, multi-step chains of reasoning without getting confused or losing focus.
I’ve been using a couple of problems to test how well models do this. One is a wedding planning problem like this:
Keep reading with a 7-day free trial
Subscribe to Understanding AI to keep reading this post and get 7 days of free access to the full post archives.