Are transformer-based foundation models running out of steam?
Leading labs are reportedly struggling to improve LLM performance with scaling.
When OpenAI released GPT-4 in March 2023, it helped to cement the conventional wisdom about “scaling laws.” GPT-4 was about 10 times larger than the model that powered the original ChatGPT, and its larger size yielded a significant jump in performance. It was widely assumed that OpenAI would soon release GPT-5, an even larger model that would deliver another big jump in performance.
But 18 months later, OpenAI hasn’t released GPT-5, and CEO Sam Altman says that no model called GPT-5 will come out this year. OpenAI has released several other models with impressive capabilities, including GPT-4o in May and o1 in September. OpenAI hasn’t revealed the size of these models, but it is widely believed that they aren’t much larger than the original GPT-4—and might even be smaller.
The story has been similar at other leading AI labs. A few months ago both Google and Anthropic updated their small and medium-sized models (Sonnet 3.5 and Haiku 3.5 for Anthropic, Pro 1.5 and Flash 1.5 for Google). But we’re still waiting for corresponding updates to their largest models (Opus 3.5 for Anthropic and Ultra 1.5 for Google).
These trends have caused a lot of people, including me, to wonder whether scaling laws are running out of steam. And in the last week, a series of news reports have provided fresh support for that thesis.
It started with The Information, which reported on Saturday that OpenAI has been disappointed with the performance of its next major model, code-named Orion: “While Orion’s performance ended up exceeding that of prior models, the increase in quality was far smaller compared with the jump between GPT-3 and GPT-4.”
On Monday, Reuters reported that “researchers at major AI labs have been running into delays and disappointing outcomes in the race to release a large language model that outperforms OpenAI’s GPT-4.”
Ilya Sutzkever, an OpenAI co-founder who left the company in May, told Reuters that “the 2010s were the age of scaling. Now we're back in the age of wonder and discovery once again.”
A Wednesday story from Bloomberg confirmed some of the details previously reported by The Information: OpenAI has been training a model called Orion that performs better than GPT-4. But the performance gain was less than OpenAI engineers expected. Meanwhile, at Google, “an upcoming iteration of its Gemini software is not living up to internal expectations.” And Anthropic “has seen the timetable slip for the release of its long-awaited Claude model called 3.5 Opus.”
Also Wednesday, The Information said that “Google has recently struggled to achieve performance gains in its Gemini conversational artificial intelligence at the same rate it did last year.”
Industry insiders still believe in scaling, but I’m skeptical
As longtime readers know, I’ve suspected that something like this would happen for close to a year. Here’s how I put it in my December writeup of Google’s Gemini 1.0 announcement:
Google’s own benchmarks show the forthcoming Gemini Ultra achieving only an incremental improvement over GPT-4.
That’s somewhat surprising because if any company can give OpenAI a run from their money, it should be Google. The big question for me is whether Google fumbled the ball in some way, or if the classic LLM architecture is starting to run out of steam.
My guess—and at this point it’s only a guess—is that we’re starting to see diminishing returns to scaling up conventional LLMs. Further progress may require significant changes to enable them to better handle long, complex inputs and to reason more effectively about abstract concepts.
I think this take has held up well over the last 11 months. And while I don’t know what’s going to happen next year, my suspicion is that the labs will continue to be disappointed as they scale up transformer-based models.
Keep reading with a 7-day free trial
Subscribe to Understanding AI to keep reading this post and get 7 days of free access to the full post archives.