Understanding AI

Understanding AI

Share this post

Understanding AI
Understanding AI
I spent two days testing DeepSeek R1
Copy link
Facebook
Email
Notes
More

I spent two days testing DeepSeek R1

DeepSeek's R1 model is almost as good as OpenAI's o1—and much cheaper.

Timothy B. Lee's avatar
Timothy B. Lee
Jan 30, 2025
∙ Paid
100

Share this post

Understanding AI
Understanding AI
I spent two days testing DeepSeek R1
Copy link
Facebook
Email
Notes
More
6
8
Share

Dean Ball and I have published three new episodes of our podcast, AI Summer, over the last week:

  • Lennart Heim on the new AI diffusion rule

  • Nathan Labenz on the future of AI scaling

  • Dean and I on DeepSeek and the future of AI

To get future episodes, search “AI Summer” (not “Understanding AI”) in your favorite podcast app.


When OpenAI released the o1-preview model in September, I described it as “easily the biggest jump in reasoning capabilities since the original GPT-4.” Ever since then, people have been wondering how long it would take for OpenAI’s rivals to catch up. Google released an experimental model called Gemini 2.0 Flash Thinking in December, but it wasn’t quite as good as o1.

Then on January 20, a little-known Chinese company called DeepSeek released a new reasoning model called R1. Western users were so impressed by R1 that over the weekend the DeepSeek mobile app shot to the top of download charts.

So I’ve spent the last couple of days testing out these three top reasoning models:

  • I spent $200 to upgrade to ChatGPT Pro, giving me access to o1’s “Pro mode.”

  • I used a new version of Gemini 2.0 Flash Thinking that was released after my previous writeup of Google’s reasoning model.

  • I used the version of R1 that’s hosted on DeepSeek’s website.

Overall, I found DeepSeek R1 to be on par with Google’s latest thinking model, but not as good as o1-pro. On the other hand, I had to pay $200 to get access to o1-pro, whereas DeepSeek is making its model available for free.1 OpenAI charges API customers $60 to output a million o1 tokens, whereas DeepSeek charges $2.19—27 times cheaper.

Given that huge cost difference, and the fact that few people outside of China had heard of DeepSeek before last week, I consider R1 to be a very impressive model.

Moreover, all three models are more capable than the o1-preview model that I gushed about back in September. And OpenAI is planning to release a new family of o3 models in the coming weeks. So the pace of progress in recent months has been extraordinary.

OpenAI CEO Sam Altman has a serious new rival to worry about. (Photo by Mike Coppola/Getty Images)

All three models are pretty good at common-sense reasoning

Last year, some language models would get confused by questions like “what’s worth more, three quarters or 100 pennies.” The reasoning models I tested this week were not fooled by brain teasers like this.

All three models know that 40 dice cannot add up to 250. They know that filling a car with helium will not cause it to float away. They know that two pounds of bricks weigh more than a pound of feathers and that 9.9 is larger than 9.11.

In short, all three of these reasoning models are head and shoulders above any models that existed prior to last September.

In my search for better brain teasers, I recently came across SimpleBench, a clever and hilarious benchmark for common-sense reasoning. Most of the SimpleBench questions are private to keep them out of model training sets. But I used the ten public SimpleBench questions to test R3, o1-pro, and Gemini 2.0 Flash Thinking. Here’s one of my favorite SimpleBench questions:

Agatha makes a stack of 5 cold, fresh single-slice ham sandwiches (with no sauces or condiments) in Room A, then immediately uses duct tape to stick the top surface of the uppermost sandwich to the bottom of her walking stick. She then walks to Room B, with her walking stick, so how many whole sandwiches are there now, in each room?

OpenAI’s o1 model realized that a sandwich isn’t going to survive a trip taped to the bottom of a walking stick and correctly predicted there would be four whole sandwiches in Room A and none in Room B. The Google and DeepSeek models didn’t realize the top sandwich would disintegrate and got the wrong answer.

But overall, the three models performed similarly on SimpleBench. DeepSeek R1 got three out of ten questions right, Gemini 2.0 Flash Thinking got four out of ten questions right, and o1-pro was the winner with five out of ten correct.2

The models are pretty good at staying on track

We’ve known for several years that LLMs produce better answers if we encourage them to “think step by step.” But earlier models had a tendency to “wander off track” if they were asked to solve problems that required many reasoning steps. The key innovation of o1 was to use a technique called reinforcement learning to train models to pursue long, multi-step chains of reasoning without getting confused or losing focus.

I’ve been using a couple of problems to test how well models do this. One is a wedding planning problem like this:

Keep reading with a 7-day free trial

Subscribe to Understanding AI to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Timothy B Lee
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More