It's the end of pretraining as we know it
LLM training techniques are evolving, and that has huge implications.
It’s Future of Transformers Week at Understanding AI! All week I’ll be examining the limitations of today’s LLMs and discussing possible solutions.
Last month I wrote about whether transformer-based foundation models were “running out of steam.” I noted a year-long trend where frontier labs have been releasing smaller models with surprisingly strong performance, but not bigger models that are dramatically better than the previous state of the art.
Anthropic, for example, released its mid-sized Claude 3.5 Sonnet model in June. Anthropic claimed that “we’ll be releasing Claude 3.5 Haiku and Claude 3.5 Opus later this year.” As promised, Anthropic released its small Claude 3.5 Haiku model in October along with an improved version of Claude 3.5 Sonnet. But Opus, the largest member of the Claude 3.5 family, is still missing in action. Anthropic’s model page used to describe Claude 3.5 Opus as coming “later this year,” but that language has disappeared.
The pattern has been similar at Google, which announced the mid-sized Gemini 1.5 Pro model in February and the small Gemini 1.5 Flash model in May. But Google hasn’t released a new large model since Gemini 1.0 Ultra, which was announced more than a year ago.
Last week Google announced the next generation of its Gemini models—Gemini 2.0. At least for now, only the smallest model, Gemini 2.0 Flash, is available to the public.
One likely reason for this trend is that small models have been exceeding expectations:
When Google announced the mid-sized Gemini 1.5 Pro, it said it performed at a “broadly similar level” to Gemini 1.0 Ultra, Google’s previous top model.
When Anthropic released the mid-sized Claude 3.5 Sonnet, it said it outperformed Anthropic’s previous top model, Claude 3.0 Opus, on a number of benchmarks.
Now Google says Gemini 2.0 Flash outperforms Gemini 1.5 Pro on key benchmarks.
At the same time, Bloomberg and The Information have reported that all three frontier labs have been disappointed in the results of recent large training runs.
According to Bloomberg, “an upcoming iteration of [Google’s] Gemini software is not living up to internal expectations,” while Anthropic “has seen the timetable slip for the release of its long-awaited Claude model called 3.5 Opus.”
Meanwhile, The Information reported that OpenAI has been training a model called Orion that was intended to be a successor to GPT-4. But although “Orion’s performance ended up exceeding that of prior models, the increase in quality was far smaller compared with the jump between GPT-3 and GPT-4.”
The mystery of Claude 3.5 Opus
Last week, SemiAnalysis published an article about this situation that I found particularly illuminating.
“Anthropic finished training Claude 3.5 Opus and it performed well, with it scaling appropriately,” the SemiAnalysis authors wrote. “Yet Anthropic didn’t release it. This is because instead of releasing publicly, Anthropic used Claude 3.5 Opus to generate synthetic data and for reward modeling to improve Claude 3.5 Sonnet significantly.”
A key thing to remember here is that larger models aren’t just more expensive to train, they’re also more expensive to use. Anthropic, for example, charges $75 per million output tokens for Claude 3.0 Opus, $15 for Claude 3.5 Sonnet, and just $4 for Claude 3.5 Haiku. So if a lab figures out how to transfer most of the capabilities from an Opus-sized model to a Sonnet- or Haiku-sized model, it has little reason to ship the larger model.
The SemiAnalysis authors hinted that the story is similar with OpenAI’s Orion model. Rather than releasing Orion to the public, OpenAI is using it to generate training data for—and judge the output of—smaller models.
SemiAnalysis urged readers to “ignore the scaling deniers”—presumably including reporters at Bloomberg and The Information—who claim that Claude 3.5 Opus didn’t perform well. According to SemiAnalysis, “this is FUD.”
I have a lot of respect for SemiAnalysis founder Dylan Patel and his deeply knowledgeable team. But it seems to me that their analysis is closer to confirming the reporting of Bloomberg and the Information than refuting it. Everyone seems to agree that Anthropic trained Claude 3.5 Opus and then decided not to release it. The only disagreement is about why.
Bloomberg and The Information reported it was because its performance was disappointing. SemiAnalysis said that it was because they wanted to use it to improve smaller models. But this seems like a why-not-both situation.
After all, Claude 3.5 Sonnet was impressive for its size, but it didn’t deliver the kind of performance breakthrough people expect from a GPT-5-class model. If Anthropic (or OpenAI) had a large model with dramatically better performance, they’d release it even if it required a much higher price to make it profitable.
OpenAI just announced a $200-per-month ChatGPT Pro tier for the most powerful version of o1. Clearly they believe customers are willing to pay a premium for sufficiently powerful models. The fact that OpenAI didn’t do this with Orion, and Anthropic didn’t do it for Claude 3.5 Opus, strongly suggests that the performance of these models isn’t that amazing.
“Pretraining as we know it will unquestionably end”
When I started this newsletter in March 2023, about two weeks after OpenAI released GPT-4, it was widely assumed that OpenAI would continue pursuing the scaling strategy that had led to GPT-4. Many people assumed that within the next couple of years, OpenAI would train a new, much larger model, likely to be called GPT-5, that would be dramatically more capable than GPT-4.
In 2023 and early 2024, there was a lot of discussion about whether the industry would “run out of data” to use in training. There were a range of views on this question, but most people thought there were enough untapped sources of data to support at least one and probably two more generations of scaling beyond GPT-4.
For example, last December podcaster Dwarkesh Patel asked Ilya Sutzkever, who was then the chief scientist at OpenAI, if OpenAI was close to running out of data.
“I would say the data situation is still quite good, there is still lots to go,” Sutzkever said. “I think you can still go very far in text only, but going multimodal seems like a very fruitful direction.”
What a difference a year makes.
Keep reading with a 7-day free trial
Subscribe to Understanding AI to keep reading this post and get 7 days of free access to the full post archives.