I don’t believe DeepSeek crashed Nvidia’s stock

But let’s talk about DeepSeek anyway.

Jan 28, 2025

Shares of Nvidia fell by nearly 17 percent on Monday, cutting nearly $600 billion from the chipmaker’s market capitalization. Shares of Nvidia rival AMD fell 6 percent, while TSMC—the Taiwanese company that manufactures Nvidia’s chips—fell 12 percent.

The selloff was widely attributed to the growing popularity of AI models from the Chinese company DeepSeek—especially R1, an open-weight competitor to OpenAI’s “reasoning” model o1.

But this explanation doesn’t make sense to me. DeepSeek’s models were trained using Nvidia chips, so it’s not obvious why DeepSeek’s success would be bad news for Nvidia. And it’s even harder to explain why it took a week for Wall Street to react to the January 20 release of R1.

A more plausible explanation is that someone tipped traders off to Donald Trump’s plans to slap tariffs on chips made in Taiwan—which Trump announced later in the day. I can’t prove this theory, but I think it fits the facts better than the DeepSeek theory. Interestingly, we didn’t see a second selloff in Nvidia or TSMC shares after Trump’s announcement, suggesting that markets had already “priced in” the news.

Nvidia CEO Jensen Huang (pictured) should blame Donald Trump, not DeepSeek CEO Liang Wenfeng. (Artur Widak/NurPhoto via Getty Image)

But correctly or not, Nvidia’s plunging stock price was widely attributed to DeepSeek, driving a surge of public interest in the company. So let’s talk about DeepSeek.

DeepSeek has been getting buzz among AI watchers for more than a year. When I did a roundup of smaller LLMs last May, I mentioned DeepSeek as one of three Chinese AI companies to watch. I noted that the DeepSeek-V2 model had the highest MMLU score of any Chinese model.

The release of V3 in December and then R1 in January cemented DeepSeek’s status as China’s leading AI lab. These models aren’t just among the best Chinese models, they’re arguably two of the best models in the world. Last week, DeepSeek CEO Liang Wenfeng was invited to speak at a symposium hosted by Chinese premier Li Qiang.

DeepSeek’s success has sparked a debate over the Biden Administration’s export controls, which were designed to prevent Chinese companies from building leading AI models. Defenders of the export controls argue it will simply take a few years for the measures to have their full impact. It’s not a crazy argument, but I do think DeepSeek’s success should make policymakers think harder about what the export control regime is trying to accomplish.

V3: A stunningly cheap and powerful model

When DeepSeek released V3 in December, it had a plausible claim to be the best-performing open model in the world. With 671 billion parameters, it was competitive with proprietary models like GPT-4o and Claude 3.5 Sonnet on a number of benchmarks—especially in math and coding. DeepSeek reported that V3 beat GPT-4o and Claude 3.5 Sonnet on problems from the challenging AIME math competition.

But perhaps the most remarkable thing about DeepSeek-V3 was its price tag. DeepSeek said it had trained V3 for about two months on a cluster of 2,048 GPUs, for a total of 2.8 million GPU-hours. Assuming it costs $2 per hour to rent a GPU, that works out to around $5.6 million in training costs.

For comparison, Meta trained the Llama 3 models on a cluster of 16,000 GPUs. Industry insiders estimate it took around 30 million GPU-hours to train the largest Llama 3.1 model, which has 405 billion parameters. In other words, training Llama required about 10 times as much computing power as training V3. And it’s not obvious that Llama is a better model than V3.

DeepSeek released a technical report explaining exactly how V3 was trained. There was no single breakthrough that enabled DeepSeek to do more with less. Rather, DeepSeek pioneered a number of techniques to squeeze more performance out of its GPUs:

A technique called “mixture of experts” divides a neural network up into a bunch of smaller networks (called experts) that specialize in different reasoning tasks. At inference time, the model automatically figures out which experts are best suited to predict any given token. Only a small fraction of the network’s 671 billion parameters are used for any given token. This reduces the amount of computation required to train or use the model. DeepSeek didn’t invent this architecture, but it made several tweaks to improve its performance.
An important bottleneck for scaling LLMs to long contexts is the need to store vectors (known as key and value vectors—see my LLM explainer for full details) for each token in a model’s context window. DeepSeek developed a technique called Multi-head Latent Attention that stores these vectors in a compressed form so that they take up less memory and can be transferred between chips more efficiently.
GPUs conventionally represent numbers with 16 or 32 bits. However, machine learning experts have found that it works well enough to represent some values using just eight bits—a technique known as quantization. Squeezing model weights down from 16 bits to eight bits allows twice as many parameters to be stored in a GPU’s memory. This ultimately allowed DeepSeek to accomplish more with fewer GPUs.

There were a number of even more technical innovations that I won’t try to explain in detail. Often training clusters are limited more by bandwidth—the capacity to move information between GPUs or between different areas inside a GPU—than by raw computing power. The DeepSeek team was able to significantly improve training performance by paying close attention to how their 2,048 GPUs were connected to one another and how they sent information back and forth.

R1: An open reasoning model

The December release of the V3 model got its share of attention, but it wasn’t seen as a major bombshell. Over the last 18 months, a number of companies have released open-weight models whose performance is impressive but not quite on the frontier. At the time of its release, DeepSeek-V3 didn’t seem that different from open-weight models from Mistral, Meta, and other companies that had been released earlier in the year.

Then on January 20, DeepSeek released R1, a variant of V3 that was trained for long-context reasoning. Whereas DeepSeek-V3 competes with GPT-4o, OpenAI’s conventional chat model, R1 competes with o1, OpenAI’s cutting-edge reasoning model.

Reasoning models like o1 or R1 generate “thinking tokens” as they work through a problem. OpenAI hides o1’s thinking tokens. In contrast, DeepSeek allows users to see the thinking tokens generated by R1.

Once again, DeepSeek published an in-depth technical report. This report was particularly interesting to the AI community because OpenAI has revealed few details about how o1 works. So people were hungry to learn what was going on inside these reasoning models.

One surprise was how little DeepSeek changed the architecture of V3 to create R1. For example, some people have speculated the o1 uses exotic techniques such as a “tree of thoughts,” where the model’s reasoning “branches out” to explore many possible reasoning steps in parallel. But R1 (and probably o1) doesn’t do anything like that: it simply generates a single long string of “thinking tokens” as it works through a problem.

DeepSeek trained the R1 model using a technique called reinforcement learning—the same technique that OpenAI used to train o1. In reinforcement learning, the model tries to solve a difficult problem with a known right answer. If the model gets the right answer, its weights are updated to make it more likely to produce answers like that in the future. If it gets the wrong answer, weights are adjusted in the opposite direction.

When o1 was released, many people assumed that training it required a large number of detailed step-by-step solutions written by human experts. But DeepSeek found that R1 could largely teach itself by trying problems, checking the solution, and then learning from its mistakes.

“DeepSeek-R1-Zero naturally acquires the ability to solve increasingly complex reasoning tasks by leveraging extended test-time computation,” DeepSeek writes in the R1 white paper. “Behaviors such as reflection—where the model revisits and reevaluates its previous steps—and the exploration of alternative approaches to problem-solving arise spontaneously.”

This was exciting to AI researchers because it suggested there was a lot of room for models to make progress without help from human experts, in much the same way that Google’s AlphaZero famously taught itself to play Go.

The case for export controls

So AI industry insiders had a number of reasons to be excited about R1. But hype about R1 soon spread beyond the insular AI community. Like OpenAI, DeepSeek has a mobile app that makes it easy for anyone to use its models. Over the weekend, more than a million people downloaded the DeepSeek app, pushing it ahead of ChatGPT to become the most downloaded iOS app.

As buzz about DeepSeek grew, three facts caught the attention of policymakers:

The Biden administration established export controls to prevent Chinese companies from training frontier models.
DeepSeek is a Chinese company.
DeepSeek trained V3, the model underlying R1, for just $5.5 million.

Some people see DeepSeek’s success as evidence that Biden’s export controls were a mistake—or at least poorly implemented. DeepSeek trained V3 and R1 using the H800—a less powerful variant of the H100 that Nvidia could legally export to China.

But defenders of Biden’s export controls argue that the system is working as intended. They point out that the Biden Administration subsequently tightened the rules to prevent further sales of the H800. And they predict that as Nvidia ships new, more powerful chips to American customers like Google and Microsoft, the gap between Western and Chinese companies will steadily grow.

Anthropic’s Dario Amodei defended the Biden approach in a Monday interview at Davos.

“2026, 2027 is the critical window,” Amodei told Economist editor in chief Zanny Minton Beddoes. “If you're ahead then, the models start getting better than humans at everything, including AI design, including using AI to make better AI, including using AI to make all kinds of intelligence and defense technologies. So I think this is pretty important."

Export controls may only give American a modest edge over their Chinese rivals. And that edge may only last for a few years. But Amodei believes that even a small advantage will be extremely valuable because he believes the country with the most powerful AI systems in 2027 could become the most powerful country for years to come.

Supporters of export controls also point out that the success of export controls is not binary. The export controls did not prevent DeepSeek from training the R1 model, but it seems likely that DeepSeek would have trained an even more powerful model by now if the company had access to more and better chips.

The case for export control skepticism

These arguments make sense as far as they go, but I wonder if supporters of export controls are missing the forest for the trees.

The US government originally established export controls because they were concerned that AI could be significant in a potential future military conflict with China. A key assumption here is that vast amounts of computing power will be required to train militarily significant AI systems. The low cost of DeepSeek V3 casts doubt on that thesis.

At this point, it simply doesn’t seem plausible to me that we can deny the Chinese military access to the chips they need to train militarily significant models. DeepSeek has demonstrated that with enough ingenuity and hard work, a company can dramatically reduce the number of GPUs required to achieve a given level of model performance.

Some supporters of export controls acknowledge that the military rationale doesn’t make sense.

“I always thought the Biden administration was a little disingenuous talking about ‘small yard, high fence’ and only defining it as military capabilities,” said China expert Jordan Schneider in a recent podcast. “That’s where the compute will go first, but if you’re talking about long-term strategic competition, much of how the Cold War was ultimately resolved came down to differential growth rates.”

In other words, the goal of the export controls isn’t to keep AI out of the hands of the Chinese military—that’s probably impossible. Rather, the goal is to essentially strangle China’s civilian AI industry and (as a result) slow down the Chinese economy as a whole. A smaller, less AI-savvy Chinese economy probably would make China a weaker adversary in the event of military conflict. But we might be creating a lot of ill will in the process. I’m not convinced it’s worth it.

DeepSeek’s success isn’t bad news for Nvidia

Conventional wisdom holds that Nvidia’s stock crashed because DeepSeek proved that AI progress won’t require larger models and more computing power. I don’t buy this at all.

As regular readers know, the trend toward ever-larger models seems to have run out of steam in 2024. Instead, the industry has pivoted toward a new post-training regime. Anthropic’s Dario Amodei explained this in his Davos interview yesterday:

At all times in the past 99.9 percent of the compute went into one kind of training, which is pretraining. We're now executing this switchover where we figured out how to put small amounts of compute into this second stage—this reinforcement learning stage. And because none of it was being done before, there are big gains to that stage. The amount of compute in that stage is increasing to the point where it will even become dominant.

So while the amount of computing power required to train an AI model has plateaued over the last year, it might start to shoot back upwards in the coming months as frontier AI labs pour resources into the new reasoning paradigm.

But even if Amodei is wrong about this, demand for inference—that is, users actually using models—is likely to be very elastic. Over the last 60 years, the cost of computing power has plunged as people figured out how to pack more transistors on a computer chip. And during that same period, companies like Intel and Nvidia grew rapidly because demand for computing power went through the roof.

I expect something similar to happen with AI. It might require less computing power to generate any given token with a DeepSeek model. But lower costs will greatly increase the number of tokens people generate.

And this is especially true with the inference compute paradigm pioneered by OpenAI with o1 and adopted by DeepSeek for R1. Researchers have found these models produce better answers if they are allowed to spend more tokens “thinking.” As tokens get cheaper, people will be willing to generate many more tokens in order to produce better answers to their queries.

One of my favorite think tanks, the Institute for Progress, is looking to hire fellows and senior fellows in emerging technology. Potential research topics include how to maintain US leadership in AI, how to build state capacity to measure and respond to improvements in AI, and how to use AI to accelerate scientific progress. Full details are here. If you apply, please tell them I sent you.

Dustin

Jan 28

I mean, I don't know (and neither does anyone), but it seems quite plausible that DeepSeek affected Nvidia prices. If the price runs up because everyone thinks you need the latest Nvidia hardware in vast quantities and then someone comes along and proves you can do it with a limited number of functionally-older hardware, then that price run up was not justified.

Expand full comment

Burgs

It affected the price because it was older Nvidia hardware. That makes their newest offerings look potentially overpriced/overvalued.

Also, OpenAI was charging $200 month for their latest model, and this Chinese company only wanted a tiny fraction of that. That absolutely calls into question how much these AI companies are going to be able to charge and earn.

22 more comments...

Understanding AI

Discussion about this post