Understanding AI
Understanding AI article audio
Nathan Lambert on the rise of "thinking" language models
0:00
Current time: 0:00 / Total time: -1:00:48
-1:00:48

Nathan Lambert on the rise of "thinking" language models

I'm launching a new podcast! Please subscribe in your favorite podcasting app.

As I announced in yesterday’s post, I’m launching a new podcast called AI Summer. Every week Dean Ball and I will talk to leading experts about the future of AI technology and policy.

I’m cross-posting the first two episodes here at Understanding AI, but most of the conversations will not be cross-posted. So if you want to listen to them, you’ll need to subscribe to the podcast directly. There are two ways to do this:

  • To get episodes in your email inbox, visit www.aisummer.org and enter your email address.

  • Open your favorite podcast app and search for “AI Summer.”

Nathan Lambert is the author of the popular AI newsletter Interconnects. He is also a research scientist who leads post-training at the Allen Institute for Artificial Intelligence, a research organization funded by the estate of Paul Allen. This means that the organization can afford to train its own models—and it’s one of the only such organizations committed to doing so in an open manner. So Lambert is one of the few people with hands-on experience building cutting-edge LLMs who can talk freely about his work. In this December 17 conversation, Lambert walked us through the steps required to train a modern model and explained how the process is evolving. Note that this conversation was recorded before OpenAI announced its new o3 model later in the month.

Timothy B. Lee: Today we want to talk about a topic in AI that's become increasingly important over the last couple of years, which is something called post-training. And recently at the NeurIPS conference a very famous ML researcher named Ilya Sutskiver talked about—Dean, actually I think you watched this more recently than me—tell me what Ilya said.

Dean Ball: So Ilya had a keynote talk at the Neural Information Processing Systems conference, which is the biggest academic machine learning conference. It has exploded in popularity. And Ilya gave a lecture to a rapt audience in which he said that, we're nearing, or have approached, have passed, the end of the pretraining era of AI.

Pretraining is, when you think of training an AI model, the big compute clusters. Predicting the next word. Predicting trillions of tokens from internet text. That's pretraining. But he said that is coming to an end because we've run out of internet data.

And so now we're in the era of post-training, which is the things you do after you have that giant internet text prediction model. How do you refine that into something that's useful to people in various different ways? Post-training's been at the heart of a lot of recent advancements in especially language models.

But it's also a mystery. Because the pretraining thing is easy enough to explain to a lay person of, okay, predict every single next word of Wikipedia and everything else on the internet. And then from that, you get something that's pretty good at predicting the next word of a sequence of texts.

Okay. But post-training's more mysterious. It's more finicky. And a lot of it is just kind of secretive I've found. A lot of the details are just not really known.

Timothy B. Lee: And there's also a lot of different ways to do it. And so we have the perfect person to help demystify some of this.

Nathan Lambert leads post-training at the Allen Institute for Artificial Intelligence which is a nonprofit organization that I believe is funded by the estate of Paul Allen. They are one of the few organizations that have the resources to do some non trivial creation and training and post-training of language models, and they are very committed to doing this in an open manner, Nathan has been working on some of the same processes that you would find at OpenAI or Anthropic or Meta, but he's able to talk about it in much more detail, Nathan, welcome to the AI Summer Podcast.

Nathan Lambert: Thanks for having me. I have a spiel that will complicate things to start. I think the technicality is that Ilya said pretraining as we know it is ending, which is that it's fast to scale on the internet, we only have one internet and we've mostly used it up.

But mostly that just means that the models are going to be more specialized as to what you need them to do. And the interface between pre- and post-training is going to need to be more synergistic. I think we've seen this by OpenAI's o1 models and how they complicate the two by saying that they do a ton of like large scale post-training techniques, which is like a contradiction within the words, but it's true that the idea of post-training and the idea of using different loss functions to train language models is evolving very fast and a large driver of new types of behaviors and performance that people see on a daily basis.

Timothy B. Lee: So AI2 recently released a new model and you did a really great article breaking down some of the techniques. Can you just give us the quick version of that? Breakdown for us, did you do pretraining or you started with an existing model?

Nathan Lambert: Yeah. So largely in post-training world. The thing that we released is a project that we called Tulu 3, which is a recipe for post-training. We apply this both to Llama models, where we do a lot of experimentation and then our internal Olmo models, which are the fully open source models that we train at the Allen Institute.

And largely pretraining recipes can be transferred between models, especially when they have similar characteristics. You end up needing to tweak the data a little bit and some other hyperparameters, but if you're training two models within a month, things aren't going to change that much.

Or if you have something that works, it'll work on other ones.

Raw pretrained models are weird

Timothy B. Lee: So actually, before you break down the post-training steps, talk to me a little about if you just take a kind of raw pretrained model that's been trained to predict the next word, but hasn't had any of this fit and finish. What's that model like to use? What is it good at? What is it not good at? Yeah, what's the user experience there?

Nathan Lambert: Essentially, these models are mostly autocomplete models. I think as post-training has become more important, people have started adding what we call construction data to pretraining, which is the idea of asking a question and then getting a response.

So especially some of the best models, like if you look at like Lama 405B and these very big latest base models you could do some question answering with them. But especially if you go back a year and you look at a base model it feels like autocomplete, it doesn't necessarily do a good job, like stopping generating. I think if you go into the old APIs of AI companies, they used to have a completions API, which is now all just like chat APIs. So that kind of base language model used to be much more of a thing in AI, but it is good to play with, to learn what you are starting with before you go through all these transformations and it is very different.

Timothy B. Lee: So if you ask it a question, it might respond with another question or may treat that as the first line of a line of dialogue or there's lots of different directions it can do other than I'm an assistant and I'm trying to answer the question you asked me, right?

Nathan Lambert: Yeah, I'd say largely it's much more unpredictable. So it's harder to give examples other than thinking of it as an autocomplete.

Dean Ball: Yeah, it's bizarre, right? I remember back in 2019 or 2020 I tried to use GPT-2 for policy research. I was trying to assess how strict municipal zoning codes were. And so you input some sections from a statute and then like the way you prompt the model is “this statute's language is…”, or like “once upon a time, a brave hero…” That's like how you would do this. But then also, they kind of drift over time.

Like inevitably a base model, like it might start out helpfully answering your question, but then over time it often will drift into being like vaguely sounding like a Reddit commentator or something like that. They're just bizarre.

The three stages of post-training

Timothy B. Lee: All right, so you've got this raw model that knows a lot of stuff, but isn't housebroken, it won't behave in predictable ways. So then walk me through, you just finished developing a post-training recipe. What are the steps that turn this into the kind of chatbots we're all used to now?

Nathan Lambert: The art of this is that there are many different tools in your toolbox, and some of them are things that everyone uses, but the amount that people use different tools varies substantially, especially if you look at leading AI labs versus people replicating what they're doing.

So I think thinking of it as a tool and then commonalities is the best way to do it. And really the process starts by deciding what you want to get out of the model. I think this is a nuanced point that a lot of education on this won't explain, which is these people sit down and they're like, “these are the evaluations we care about. These are the behaviors we really need to maximize. These are behaviors that we don't care about.”

And then you go and you get a bunch of user queries. Or you make user queries that correspond to that. So if you care about math, you'll have example math problems. And you're like, these are like problems and questions that you train on.

And then you end up doing a lot of different things with these queries. And I would summarize the kind of three stages of what people do as instruction tuning, which is the oldest technique, which is actually using the same loss function as pretraining, but making the model format in this question answer-thing that people are now used to rather than this auto complete form. And there's some things behind the scenes where essentially you change the tokenizer and you change how information is processed very slightly so that it's structured in a question and answer format.

And then there's what would be called preference tuning, which is a generalized concept from reinforcement learning from human feedback, where A/B testing ideas and collecting human preferences comes in. And that is really good for making the models follow the human different types of style. So things like markdown lists and bolded headings and just consistent model length and some consistent themes and answers comes up there. And that used to be what most of post-training was described as. So it's like instruction tuning and reinforcement learning from human feedback.

And in the past year, especially, we've seen with the OpenAI releasing o1, we've seen OpenAI release a RL fine tuning API, and we've seen a lot of research on using other loss functions to further improve performance on these tasks that we started with that we really care about. So I think OpenAI's fine tuning API called reinforcement fine tuning is actually the best generalization of this, which is using an optimizer called reinforcement learning, which is a big area of study in AI on its own. But we're specifically taking the optimizers that have been developed there and applying them to language models to run on verifiable domains like math and code and kind of reinforce very specific behaviors and capabilities into the models.

I think there's a lot of nuance in what is actually learned at these different stages, and I give hints along the way, but these are the core tools and the normal order is the order that I went in, which is instruction tuning, preference tuning, and then reinforcement tuning. But there's a lot of interweaving of things at frontier model labs where they have different constraints on when they get the data. They have multiple teams working on these and they need to be able to fork progress and then bring it back together.

At the Allen Institute, we had a small team, so we did these sequentially one step at a time. It's easy to abstract, understand what's going on. But in reality, there's multiple hundreds of people that work on just this process of training a language model.

Instruction tuning explained

Timothy B. Lee: So we have those three steps. Let's just walk through them step by step. So instruction tuning, what I think I heard you say is that it's largely the same process as pretraining.

You're still predicting the next word, but you have a new training set that's in the sort of question and answer format that you're expecting, that you're hoping the model to do. And you're just saying, okay, you know all this stuff, but now the format of output we want is the user asks a question and you give a helpful response. And it's just more training on those kinds of chat transcripts? Is that right?

Nathan Lambert: Yeah, and it will really increase the kind of prevalence of features in that data. So if your instructions have certain ways that the language model talks, if it says certain things—“sure, I'm happy to help”—if it says that a lot the model will learn to keep repeating these features.

At the same time, an impressive amount of performance in post-training comes just from doing this change of format, mostly because the evaluations and behaviors that people care about from these models is in this format.

So if you look at a comparison between a base model and what is called an instruction tuned or a supervised fine tuned model, a very large amount of performance gain just comes from doing this, which is partially from learning the format and partially from this new data you've added in.

It could be very stark in terms of just how immediate the gains are.

Timothy B. Lee: And where did you guys get the data for that? Is there an open data set of instruction tuned training data, or did you hire somebody to do it?

Nathan Lambert: There's a large mix right now. In the open, there are a few human generated datasets with various licenses . Some of them are from community projects where they sit down and people write questions and answers and filter them themselves. There's limited data from like AI data vendors that is available online, normally under more restrictive licenses.

And most of it is synthetic data from a bigger language model. You have some prompt. There are a few datasets that have human prompts, which is very valuable to the research community. And you take those prompts and you prompt a model like GPT-4 or the best Llama model, and it generates a completion.

And those completions tend to actually be very competitive with any human completion in terms of just quality and accuracy, given the top end of language models that we have. So even closed labs have said that they're shifting their instruction completions to be more from AI relative to humans, where one, and especially two years ago, most of these instructions were all written by humans.

Timothy B. Lee: And for you guys, I assume you use an open model like Llama to do that portion of it.

Nathan Lambert: We actually use OpenAI's API.

Timothy B. Lee: Okay, and they allow that?

Nathan Lambert: It's a gray area in the terms of service, but we are not trying to build competitors to OpenAI's products. We are trying to build research methods in post-training.

The evolution of preference tuning

Timothy B. Lee: Okay, so then that's the first step. Then the second step: when I started paying attention to this 12 to 18 months ago, I heard a lot about reinforcement learning for human feedback. That was the original version of this that that people, that labs used, right?

Nathan Lambert: Reinforcement learning from human feedback is still definitely used in these labs. I think there's an integration of kind of AI feedback on these preferences and the big reason why I've generalized to preference fine tuning rather than reinforcement learning from human feedback is an emergence of a more direct way of doing this, which is from something called direct alignment algorithms.

The first and most popular is something called direct preference optimization, which uses the same underlying data, but doesn't actually use the reinforced learning optimizer. So the optimizer is just what takes the gradient steps and actually updates the weights of the models. So there's a different way to do this with some interesting math. And it doesn't technically use RL. So generalizing to preference fine tuning, you compare two answers, you compare A to B and you decide that one is better than the other, and that's the human preference.

So it is more general in that domain, but it's less of the overall picture in people's minds where RLHF is this really big term and it still is a big term, but it means something very different.

Timothy B. Lee: So you've got a model and a prompt, you ask your not completely post-trained model to produce two outputs and then you ask a human being, which of these outputs is better.

Nathan Lambert: Yeah. Or like in our case, we use AI models to generate these comparisons, which is a very open research question on exactly how the behavior of the models change, whether or not you're using human or AI as a judge.

In terms of basic capabilities, things you would see on an announcement with math or code, normally language models can do very similarly as judges compared to humans. But I know that there's going to be research that figures out what types of biases and noise is different in humans versus language models. So it's almost too convenient that the language models as a judge just still works on the paper level and like the high level sniff test. But there's definitely differences that the research field hasn't fully documented yet.

Dean Ball: And it's just worth pointing out that reinforcement learning from human feedback is the innovation that enabled ChatGPT, right? GPT-3 in 2020 to Chat GPT 3.5, which was the model underlying ChatGPT in November of 2022. There were, I'm sure, a million differences between those models but the big one was RLHF. They had figured out how to get this human preference data, and create a model that is more aligned with human preferences.

I just think that's worth pointing out because my observation has always been that reinforcement learning and post-training have been at the heart of consumer AI advances, at least in the language modeling world, since consumers became aware of what language models even are.

Nathan Lambert: I do this regularly. I quote the original ChatGPT blog post when I'm talking about post-training. It's not very long. One of the first sentences is we leverage RLHF, Reinforced Learning from Human Feedback to do this. And it also leads into the fact that a difference between academic and industry post-training is that user interface design is a very large part of the end of post-training in these labs, like how people use and experience the models is deeply influential on how model behavior is perceived and how these models actually work.

So that's something that's hard to do academic research on because you need to have so many users and you do really subtle A/B testing. Where the text box is and how fast the tokens come back, these inflect how people perceive the models. There's going to be a ton of research on this.

Just like Google has mastered the search bar—that can be encompassed in post-training. When I talk to friends and at these frontier labs, they describe that as part of the post-training purview.

How RLHF works

Timothy B. Lee: Okay. So I'd like you to explain this to me like I'm five. So I've got a model and I've got two generations from that model or two example responses. And I've got a rating that says this one's better than that one.

Walk me through what one of these either RLHF or DPO or one of the others that you mentioned, walk me through what that actually does. Yeah. So this is still an intuition that I think is not something that many people communicate.

Nathan Lambert: This is something that I would ask like John Shulman, like the person who is on the paper as creating ChatGPT, like one of the people behind RLHF is this type of comparison loss function is not super simple in what it changes to the model. On a mathematical level, it's what would be called a contrastive loss function.

So you're updating the model based on the difference between multiple completions or multiple pieces of text. As you're doing this pretraining, it's very simple. You have to predict the next word accurately. And in supervised fine tuning, it's very similar. You're doing, you're predicting the next word accurately, but on a different format of text.

But what is happening in RLHF or preference tuning is a credit assignment between the full completion at once. So if you say there's kind of two ways to do this, you either do this with reinforcement learning, which assigns a scalar reward to each of these pieces of text, or if you’re using a direct alignment algorithm, you essentially assign like a win or a loss, which you could think of as a reward of one and a reward of zero. So you're implicitly doing these weightings.

It has this numerical value for the whole piece of text, which is a prompt and a response. It can be multiple turns. It's just for one piece of the conversation. And the model is increasing the probability of the chosen response occurring and decreasing the probability of the rejected response.

There's a lot of mathematical messiness because you can think about this margin. It's like you could actually just decrease the probability of the rejected and that would increase the margin. You can just increase the probability of the winning, the chosen, and that would increase the probability and it's affecting all of these tokens at once.

So it is much less clear on exactly how you would describe the behavior changing, but it could be inspired by something like behavior cloning where you're trying to get much closer to the responses that the human raters choose, but in practice, this is over large scales of data. It's over like a million data points.

So it's a lot harder to say exactly what a specific thing that would change, but it works in practice where if you consistently have people choosing a specific format or a certain voice to the model, that can then be amplified and ingrained into these parameters.

People talk about Claude having a very specific voice and they've given hints over the years that they do this sort of preference tuning to make it so that Claude has a very specific character. And that's beyond the principles that they say, like the human rights principles and constitutional AI, they do more things in this preference tuning to choose these chosen responses that are corresponding to specific behaviors that they really like in Claude.

And that's how Claude has a very different type of response quality than ChatGPT. I think Claude is much more inquisitive. It's more engaging. And these are things that they're reinforcing in this preference tuning phase.

Timothy B. Lee: So let me see if I can understand a little bit over a granular level. So you said you've got the score for the whole sequence. But then are you doing some kind of training on the individual tokens based on that?

Nathan Lambert: It updates the probability of choosing every token in the sequence. So the weight is assigned to the conversation turn, but the attribution per token depends on the token's probability itself and the magnitude of score assigned to it.

So yes, it is updating all of the tokens.

Timothy B. Lee: So for each token, it says this is a token in the good ones, we want to make it more likely to produce that token. If we ran the generation again, and then does backpropagation for that token. It does that for every single token in the good one. And then the opposite for every single token in the bad one.

Nathan Lambert: Yeah. And technically it's not going to be this, it doesn't necessarily need to be the same change per token because it depends on the algorithm. Some algorithms will do more different types of like per token attribution. It depends on the current state of the model that you're training, where it could be that it was like a really rare token and it figures out that one was really important.

And the delta there could be much higher than like filler tokens, like the, or a, or periods or something. So that's why I say it's much messier to describe because it changes all these tokens at once rather than just one token at a time in the prediction loss function.

Timothy B. Lee: And so how should users, how should people think about the difference between RLHF and DPO and there's some other ones too. Is that basically just like one's more efficient or something or do they give you different behaviors.

Nathan Lambert: They work in different levels of the model. Instruction tuning can help with a lot of things that can help with general vibe or chatting. It can help with math. It can help with coding because it is this foundational stage where everything just improves. Preference tuning can help with things like math and code but in a more general sense it just helps with how much do people like using this model?

You could take two models with dramatically different intelligence as measured by coding skills, math skills, reasoning skills, and in terms of general A/B testing, like which one do people like more on general queries, asking about history or whatever, if you do a better preference tuning on the less intelligent model, you could definitely have people prefer both of the models equally. On chatbot arena, you can see that there are less intelligent models that match much more intelligent models by just really argmaxing on this preference tuning and really focusing on style and getting consistent responses.

So this is normally during preference tuning where you can really push this chattiness or get this chattiness exactly where you want it in preference tuning, which is different than the other stages.

And then we go to this reinforcement fine tuning stage, it tends to be focused on specific capabilities again. So like really zooming on things that you can verify like math or code, and then it'll improve those capabilities, but it doesn't really shift the chattiness or this kind of underlying foundation of the model in other areas than what you're training on.

The rise of reinforcement tuning

Dean Ball: So that's reinforcement learning from human feedback. There is of course also this other school of post-training that we've seen more recently, as you mentioned, in OpenAI's o1 models, probably all of the frontier labs are working on similar kinds of approaches. And sort of simple way of thinking about this is if you look at AlphaGo from many years ago, DeepMind's AI system that could play Go they trained it using reinforcement learning and eventually with AlphaZero entirely synthetic data, games that the system played against itself.

And with the reinforcement learning algorithm, they were able to create a version of AlphaGo, which was a heretofore impossibly complex game for AI systems. They were able to train it to be superhuman and to beat the best human Go players in the world. And, what people have talked about for a long time—at least, since ChatGPT, but also really long before that in the research community—is what if you had AlphaGo for everything? What if you had a system that reasoned about things that come to the best possible token, just like AlphaGo found the best possible move in Go?

You've written quite a lot on trying to figure out how the o1 models work. How do they work and how close are they to this kind of stylized ideal of an AlphaGo-like reasoner?

Nathan Lambert: There's kind of two questions. One is what is the generalized thing that people should know about post-training and this reinforcement learning idea? And two, what is o1?

I think the first question is actually much more important for people to know, which is like, what is reinforcement learning and what is it actually doing on language models? And why is it just showing up now? So reinforcement learning is ultimately trial and error learning.

It goes back to sequential decision making—a lot of history with agents. That idea of trying a problem many times and continuing to learn from experience hasn't been applied that much in language models. Reinforcement learning from human feedback is confusing in its name because it's using the reinforced learning optimizer, but not the full suite of what reinforced learning was designed to do.

So reinforced learning is really about sequential decision making, trial and error, just repeating these words. It's important. And OpenAI actually made this a lot clearer to communicate when they announced their reinforcement fine tuning API. And you can watch the video. You can read my blog post on this.

But what they described and what we described in this Tulu 3 project is the idea that we're giving the model many chances to see the same problem and to generate different solutions. And you use a way of learning from that to let the model get better at performing on the same question or the same type of questions over multiple iterations of this question. And this is what OpenAI is describing in their reinforcement fine tuning API. We call this reinforcement learning with verifiable rewards in our paper.

And the real important thing is that you need a way to check the answer and reinforcement learning historically, that would be called the reward function. It's the thing that the agent is optimizing against. And this is why it's very different than the other stages of post-training, is that the model is seeing these questions multiple times and just slowly improving its behaviors and how it approaches questions of this type.

OpenAI has confirmed that this reinforcement fine tuning API is built on the same infrastructure used to train their o1 models, and the o1 blog posts describe everything as large-scale reinforcement learning. So o1 is the idea of what happens to a language model if you have a ton of these problems where it can try and try again to keep learning new behaviors on a wide variety of domains.

When you think about traditional post-training, reinforcement fine tuning will normally be for just a few behaviors that a lab cares about, or an application cares about.

If you just want to get really good at math, you'll do reinforcement fine tuning, but o1 is the general machine learning approach to what happens if an AI model is taught to reason repetitively across every domain that we can find. Every math domain, every science domain, I'm sure they will add logical reasoning, I guess, logical reasoning as connected to math is already in there.

What does it mean for a language model to just get new behaviors from trying many times? And the core of o1 is this scaling up of reinforcement learning training. Dylan Patel at Semianalysis had a post where he quoted they had about 10 million problems in this, post-training set of o1.

And I think o1 is this shifting point between pre- and post-training. So the only real reason why it's called the post-training era is because these traditionally post-training loss functions—so RLHF, RL with verifiable rewards and everything—is just used for a lot more compute. In a realistic level, it's that pretraining as we know it has ended and pretraining is now different. Pretraining is about scaling up a more diverse set of data sources, more diverse loss functions. And that's what o1 and Ilya's talk kind of all point back to.

Dean Ball: And this is why I'm especially excited to be talking to you because your sort of background is reinforcement learning, right? That's where you specialize. And it is often said that RLHF is not really reinforcement learning in the way that you're talking about whereas what's going on with that o1 really is.

And it’s also, you know, RLHF was maybe 1 percent of the compute budget—something like that—of a frontier model, of Claude 3.5 Sonnet. But I’d be willing to guess that reinforcement learning is—you know, it’s not that pretraining is going away, there was a pretrained model that led to o1, of course. But maybe [reinforcement learning is] 10 or 20 percent of the compute budget, instead of 1 percent of the compute budget.

The theory, that OpenAI had is we can make this thing good at math, if we have a bunch of math problems that have been solved out, step by step by humans. And we use that to generate reward signal with correct answers. And the thing can learn to find the correct answers to math.

It's amazing that works for math. But I think there was also a hope that that superior reasoning ability would somehow transfer over to other natural language domains that are varying degrees of fuzzy, right? So maybe we don't think it's going to be able to write a better analysis of a novel because it has this capability.

Maybe we do. But maybe it can write a better legal brief because legal reasoning certainly has some structural similarities to mathematical or coding reasoning. What do you think about that thesis, Nathan? Like that thesis that, this can be used to generate general reasoners. Is what you're learning reasoning itself?

Or are you learning to memorize traces of reasoning? Or is there a difference? Is there not a difference between those things?

Nathan Lambert: It does both, but at different times in the sequence. I think to start the models, you need to do some of this human data, start the formatting but much less than people would expect.

And a lot of the compute on these things is just the model generating new attempts at problems. So when you think about it, that is definitely exploring some new spaces of reasoning of learning or whatever you want to call it. And I would actually bet that these behaviors will emerge in many different domains and that there's a big exposure risk to these companies to have a model that will reason about anything.

So there's probably some control to have this kind of on, off behavior on what it really comments on, what it shows its reasoning steps for, because there's many examples you could come up with where you don't want a model to have a lengthy argument on certain like comparing different races.

But there's a lot of reasons why you don't want to have the model just do this for everything when you're serving a business. In the broad post-training literature, there are a lot of weird things where different post-training techniques generalize to unexpected domains. For example, if you do like this RLHF safety training on preferences in English only, it's been observed many times that at least partially generalizes to different languages.

So in language models, where behaviors are stored and how they are elicited is much more complicated than something like Word2Vec, where it is an explicit vector space where you can do this distance between related ideas.

We don't really know exactly how this complex reasoning behavior will transfer, but I suspect that we will learn things very quickly when we have model weights that do this, and you can look to see and elicit this behavior in other domains.

Traditional scaling is running out of steam

Timothy B. Lee: So when GPT-4 came out 18 months ago, I think the expectation a lot of people had was there's gonna be a GPT-5 in a couple years, and it'll be ten times larger and have ten times as much data, and we'll see another kind of leap in performance, where because you have a bigger model, it will generalize more things and be smarter.

And obviously, GPT-5 isn't out yet, and you have this o1 model that's in a different direction. Is your expectation that the scaling everybody was talking about 18 months ago is still likely to happen and that the o1-style post-training reinforcement learning thing is likely to happen in parallel? Or do you think it's more likely that the old scaling paradigm is going to run out of steam and the focus will all be on this new approach or some other approach—that we've gotten what we can get out of just throwing more data and having more parameters in these models.

Nathan Lambert: Scaling laws are a very specific thing, which is like a power law, which is a mathematical function shape between test loss, which is like word prediction accuracy, and the amount of compute that you put in. I'm pretty sure that loss is still going down, but the pain of a scaling law is that you have to put exponentially more compute in to get like a proportional reduction in this loss function.

I generally think that this type of scaling is not something that most users of ChatGPT will see any difference from at this point. But at a technical level, I'm sure the loss is still going down. Landing this giant model type of thing in a product sense is who is going to get the right value out of it and in what domain.

So what I hear is that these companies have these big models and they largely use them to make these post-training techniques a lot better. They generate synthetic data. They do LM is a judge. They do like this synthetic preference idea. They do all sorts of things to improve the models that people are using.

Claude 3.5 Sonnet is improving really fast. Gemini Flash. Even GPT-4o is improving really fast. And a lot of industry insiders attribute this to having bigger models, but it's a hard story to land for the average consumer where scaling essentially means different things. And that's back to that's what Ilya meant is like scaling needs to be used in much more tactical ways to improve things that people care about.

But at the end of the day, it's not like going from GPT-4 to GPT-5 will make the text that it generates shift from feeling AI-y to feeling like a masterpiece of American literature.

In a lot of ways, they've backed themselves into the corner of messaging. The answer to your question in some ways is yes, but in technical ways is no. And this is like a hard time to be maneuvering a growing fast AI company. There's a lot of market forces behind that, too.

Timothy B. Lee: Yeah, so just to make sure I understand what you're saying. You're saying that what you've heard is that right now, these companies have models that are larger than the frontier models that are available, and the plan is not to turn that into a new product. The plan is to use that as infrastructure that lets them do distillation and judging and various things to take the expertise of these very large models that aren't probably economical to use in the marketplace, but transfer the expertise they have to smaller models that then are the models that the users have access to.

Yeah, I'm mostly saying that's what's been happening for the past few months. And I would guess that people are going to have access to these models eventually. But I think that they're getting much more value out of using the models as a tool than exposing them to users, which is not too surprising.

And I try not to make predictions about the future because the only safe prediction AI right now is things will get much better and things that we don't expect will show up. o1 is the best example of things we don't expect showing up this year.

So I don't want to make any bets on us not getting access, but it is very safe to say people are using big models to make smaller models better right now, scaling laws as people traditionally thought about them and how they were communicated— that famous Microsoft slide where there's like the exponential and the whale—that type of nonsense is definitely dying out. But there are a lot of underlying things that are becoming more complicated, but are still delivering much better models to both customer and enterprise use cases.

Timothy B. Lee: So this is interesting to me. I've been wondering for a while, why isn't there a Cloud 3. 5 Opus? Why isn't there a Gemini 1. 5 Ultra? And I've asked a number of people. You're the first person I've heard to say that actually those models exist, basically, and they just haven't made them available.

And I assume the reason is just that if they're ten times bigger, it costs ten times as much to do the inference, and so nobody's going to want to pay that if you can get almost as good in a smaller version of the model. Is that right?

Nathan Lambert: Yeah. I think especially if you think about the average ChatGPT user, either they're talking about random things looking up historical facts or learning about some esoteric science concept. And it's not going to dramatically change how that is done.

Don’t expect an end to the data center boom

Timothy B. Lee: I know you said you don't want to do predictions, but there's this discourse in some parts of the AI universe about how we're going to have continued scaling and we're going to have trillion dollar data centers in five years so we can make, GPT-7 that's, ten trillion parameters or whatever.

Do you have any thoughts about if this is the trend? Do you think it's still going to be the case that we're going to make bigger and bigger models but they're maybe going to stay private for longer? Or are we a generation or two from the point where we've run out of Internet and you have diminishing returns and so just making the models bigger isn't that useful anymore?

Nathan Lambert: The thing is the data centers also are being used for training runs that are more expensive, not just in raw parameter count. So that's the thing with o1 is instead of one to five percent of compute being in post-training, it'll be something like 40 percent. And we scale all of these up in tandem and the largest limiting factor in my colleagues’ performance and like our model improvements is compute.

AI2, we're not Google, we're not compute rich, but I am very sure of similar dynamics being at all of these labs where the biggest bottleneck to progress is people not having enough compute, especially when a large proportion of available compute is going to people actually using AI, I don't have the numbers for OpenAI but I'm sure that double-digit percentages are allocated to both training and inference of the total compute that they have. And if they double the research budget, progress will go much faster. And we're on that hill where it's both more compute will increase the rate of improvement and more people are wanting to use these models. And at least in the near term, the build out feels extremely obvious to me.

Timothy B. Lee: That makes perfect sense. But one of the things people have said is that because of bandwidth constraints, if you want to train a really big model, you need a bunch of GPUs physically close to each other. This is why supposedly companies are renting out [nuclear power plants].

Those other uses, if you're doing o1 style training or inference, is it as important for that to be one big data center as a bunch of little ones? For inference, it's not, right? You can spread it out geographically.

Nathan Lambert: Yeah. It's not as important as the large scale base model. There's definitely improvements to be like you have the model that's generating text in this post-training phase on one part of the data center, and then you have the model that's actually updating the weights and doing the gradient descent.

And a lot of things you have to do is you have to make these weights all equal. So you take a couple of learning steps and then you have to send the weights across your data center to make sure you have the latest version of your model when you're doing post-training and that would take a lot longer because you have to send the data to a different data center across the country.

But that is not as bad as the traditional pretraining where you need to do this like many times a second. I think the pretraining is like how do you synchronize weights after—I don't know—every step or however many few steps it is. So it is definitely not as crucial, but I do think that there's value there. And I think a lot of it is just instead of having to do the hacks to use multiple data centers or cram your pretraining into one place there's still going to be value of having these bigger data centers.

I'm not the expert on pretraining and systems. I think you could get somebody on that knows a lot more about data centers than I do.

What’s happening inside o1?

Dean Ball: So I want to just go back to o1 itself and that whole paradigm. What's happening, right? You go to o1, you go to chatgpt.com and you pull up o1 if you pay for ChatGPT, and you ask it a question, and sometimes it shrugs its shoulder at you. It thinks about your question for five seconds, which means you didn't ask a very hard question.

But if you ask a hard question, it might think for 45 seconds, a minute, minute and a half. o1 pro, which is a new model from OpenAI—not a new model, but a new tier really of model pro mode, they call it not pro. That's $200 a month. That will think for two or three minutes or more sometimes like up to five or six minutes about certain very hard questions. When you're sitting there watching the model think, what is actually going on behind the scenes?

Nathan Lambert: I do know what, at least the pro mode, we don't know the exact details on, but for o1, the blog post says this, and there's examples. It's just generating a big stream of tokens where it's been taught to check its own work.

I think if you use a language model, you use ChatGPT, when you enter a query token, start appearing, it's called streaming where the parts of the answer come. And OpenAI has decided to hide this streaming behavior from o1, either—if you are genuinely taking their word at face value—to make the user experience better because the tokens it would be streaming would be very confusing because it'd be like, let me try that. Then let me try this. Oh, and that was wrong. That's not a great user experience.

But also in a competitive landscape where o1 is one of the few things that OpenAI has that Google does not. Google historically releases all the same AI products that OpenAI does from video to image to whatever, the information of how the model behaves in a really detailed manner would be important to replicating it.

So the behavior of o1 on a mechanical level is it is generating a lot of tokens and not showing you. And when it takes longer to answer, it is generating more tokens. And they just decide to not show it to you. And then at some point, o1 will generate a special token, which is like a, “I'm starting the actual answer” thing.

And then that's when you'll see the tokens appear as a user.

Dean Ball: And so when it's generating the tokens that you don't see, it's talking to itself would be one way to think about it. Is it generating like a bunch of different answers to like potential answers to the prompt and searching over them? Or is it just going like, all right, how would I go about solving this problem? Here's one thing and here's another thing and dah dah dah, and oh, I made a mistake there.

Is it more linear? Or is it more a search being executed in parallel over different potential answers to the question.

Nathan Lambert: Almost everyone at this point agrees that it's just doing one linear thing for normal o1. I think some of the hints on o1 Pro is that people think that it's doing what is technically called pass at n which is it does multiple completions in parallel and then compares the final answers of both of them to then give it to you .

Generally tree search is extremely expensive to implement in terms of the rate of compute increase. So there are plenty of projects in the open and probably at these labs where they are doing something with trees to improve performance, but in terms of a cost to benefit in terms of serving the models, o1 is this model where you get the model to do this linear reasoning at training time and potentially at training time, you're doing tree search to get the exact right training data so that at test time, it's much easier to deploy. o1 is very likely not doing any tree search or anything like that.

It's just rambling. What is often seen as a broken language model is just, very long rambling. And o1 kind of harnesses that in a very reasonable way, which is a sophisticated type of control.

Dean Ball: And you can see this too in the Chinese company DeepSeek has released an o1 like model called R1. Obviously we don't know if it's the same exact approach, same algorithm, et cetera, but they are showing you the hidden tokens, the ones that OpenAI hides, the chain of thought is shown to you. And it is bizarre. Sometimes it will slip into Mandarin, it'll start writing to itself. And then other times it goes back. I've heard that o1 does that too. o1 will go into other languages.

Also, sometimes it tweets like OpenAI employees, all lowercase. It'll write like they tweet, so all kinds of strange behaviors. The point that I think is worth emphasizing for listeners is this ability to correct mistakes and that being trained in.

o1 has superior epistemics

One finding that I thought was interesting is OpenAI also released a new benchmark called SimpleQA, which is really just like a hallucination benchmark. It asks models factual questions that are just extremely obscure. o1 hallucinates sometimes on those. The hallucination rate, is not, for me, the interesting thing, it is lower than other models, I think.

But the other thing that's interesting about that paper is that they found that o1 had a substantially higher rate of basically saying, I don't know the answer to that question. And that is interesting because that suggests to me that the model has better epistemics, so it has a sense of what it knows and what it does not know.

And that's been a problem with all kinds of language models over the—

Nathan Lambert: In some ways that's downstream of training to is the next token, right? Because one token is just part of an answer and it's hard to know if the part of the answer is right versus training on a large set of problems that are verified to be true. So as we go into 2025, that seems like a great eval to use in our work, which is like, how do we blend from this first era of post-training, which is much more aligned with the language modeling loss to this kind of new time where you expect to be training on if things are right and wrong.

And the calibration, as you said, is a very hard thing to do.

Dean Ball: So am I right, as someone who is not machine learning background, am I right to be intrigued by this finding? Oh, that's eyebrow raising or do you have better research tastes than me? And is there something I'm just missing? I'm just, am I just being dumb about something?

Nathan Lambert: Oh, no, my reaction was literally like, Oh, I gotta go Slack people I work with and be like, “we should use this in our next training.” This is like where things are going.

A lot of things in AI just are breaking down as this post-training loss changes things. Scaling laws are based on this next token prediction loss and they're like plotting token prediction accuracy. It's the loss function versus compute.

And it's if we're using an RL loss, we are spending a lot of compute, but it is not focused on next token accuracy. The loss has an implicit, regularization to not change the model too much. But it is trying to change the model to do things specifically rather than predict words. So if we spend a lot of compute on that, the idea of a scaling law is a very different thing.

Timothy B. Lee: If you're doing next token prediction, you get a positive score if you get exactly the same token, you get a negative score if you get a different token. But, if the model produces a different stream of tokens that get to the same, say you're doing a math problem, there's many different ways to write the sentence that the answer to your question is five. You don't really care that much about what those words are, you care whether the five is a five. And so it's the idea that RL gives you a better signal of the five isn't the token you care about and you don't care that much about the other tokens.

Nathan Lambert: Yeah. It's like the credit assignment is the idea. Within the reinforced learning optimizer is something called a value function, which assigns a score to every one of these intermediate tokens.

It's very easy to see how that would end up learning the final answer token to be much more valuable than the intermediate "the answer is" part of it, given that the final answer is what is checked. So that is true. So much of machine learning is built on like stochastic gradient descent, which is averaging across a batch of samples and like a naive average is squashing all the learning you're doing to be seen as equivalent in priority.

Whereas if you have credit assignment that can learn which part of it is more important, the learning, the shape of learning is very different.

o1 could dramatically improve AI coding tools

Dean Ball: So I have one question for you about these models and coding because it does seem like the o1 models are better at coding, but also, there's been the advances in coding between GPT 4 and Claude 3.5 Sonnet or certainly o1 are enormous, right? Like we haven't had a new generation of model, but the models are way better at coding than they were 18 months ago.

Obviously, like some part of that is because you do have a ground truth signal to some extent on code, right? Did the code execute or did it not? But that is not as sophisticated of a ground truth signal as it is for mathematics.

Like, there's a quality to the reasoning in code, right? There's efficient code, there's code with bugs in it, right?

Nathan Lambert: There's also unit tests. So it ends up being a much more complicated reward shaping problem. We haven't applied this in our work to code mostly because infrastructure and kind of design is harder and data is harder. But that's the next step is you do things like you have code questions with unit tests, you might have what is called like a linter in a code development environment. Make sure all the formatting for your language is following specific rules that you specify.

There's debates in Python, like tabs versus spaces and stuff like that, but there's a lot more subtle changes. And you can bake all of these into your reward function. But I'm much less sure of what it will be learning. And I know that these frontier labs are doing this where they have unit tests, they have many different things that are grading across code. There are research papers talking about this too. It's just much more complicated to bake this into your general ChatGPT model while having everything at once.

And that's one of the frontiers in academic research. But we know that these closed labs are doing this where they do subtle reward design and formatting problems for code in this kind of reinforcement fine tuning domain.

So it's mostly just that we have the existence proof and we need to figure out how to do this so that it's transparent.

Dean Ball: Because it's just like a way higher dimensional thing compared to the yes, no, did the code compile or not? But also, there are some programming languages that are more susceptible to formal verification than others.

And so my question to you is Rust would be an example of such a thing, right? Do performance increases in a specific programming language transfer to other ones? If I get really good at Rust because I have a good way to make a reward signal with formal verification of Rust code, does that make me better at C++ or Python?

I would guess that there's at least a correlation, but the magnitude of correlation is hard to predict. So if you only train on Python, you're going to get a lot of improvements on Python and you're going to also get some improvements on other things, largely because the model has seen a lot of good code and bad code.

And I'm guessing there's actual similarities that it could extract by a really focused training, but it's kind of this blurred line of how do you do the art of it?

Timothy B. Lee: All right let's leave it there. Nathan Lambert. Thanks so much for coming on and helping us understand post-training and the future of language models.

Nathan Lambert: Thanks for having me. Let's avoid the winter.

Dean Ball: Thank you, Nathan. Well that was great. I'm a big fan of Nathan. Nathan is one of the best explaining the in the weeds post-training stuff to non specialists. But what'd you think Tim?

Timothy B. Lee: It seems like the industry is in a period of transition. This seems like a case where I haven't quite kept up myself. But this idea that traditional scaling where you make the models 10 times bigger and then you ship that and that's GPT 5 and people want to use that.

He didn't quite say this, but basically he was saying that's dead and there's going to be a new model —the kind of o1 model. Where there's just an entirely different way of training models.

So far o1 has been very successful at a narrow set of domains, math and coding basically. And the big question is, if we keep doing this, are models going to get better at—like you said—writing a legal brief, or doing analysis of a patient in a medical setting or something like that. So far I don't think we've seen much progress on those things. And the big question is, if you do the things that are easily verified, does that somehow generalize or do you find a way to apply this technique to other domains?

And yeah, it sounded like Nathan was at least open to that possibility. It's at best an open question. What did you hear?

Dean Ball: I thought it was very interesting to hear that from Nathan. And also just to talk about the economic realities of the bigger models, the fact that you can't release those to consumers for economic reasons. And also probably because if you're OpenAI, once you've released your biggest model, then it's not just you that can distill from it and make smaller performant models. Other people can do that too.

Whereas with o1, so much of the stuff that you would really want to train on are those hidden chains of thought that you can't access for training purposes. I was very intrigued. He's not within a frontier lab, but he understands the culture and the mindset at the highest levels of the AI community right now.

Timothy B. Lee: Nathan's a pretty buttoned-down guy, he needs to stay pretty close to what he can back up, but this kind of larger narrative that we're a few years from AGI and we just need to scale up the models to get to AGI.

Clearly, he doesn't think, and I don't think, just literally scaling up the size of the models is going to get there. And then the question is, can we do it with o1? And, I'm open to the possibility that there will be some generalization, where training on math and coding is going to get you some additional abilities.

But I guess this reinforces my prior that I think probably we're still pretty far from anything you could call AGI. Just from extending this post-training reinforcement learning kind of approach. And that probably some additional techniques [will be needed]. He didn't exactly say that, but it seems to me like there's not as clearly an obvious path as there might have been a year or two ago.

Dean Ball: I don't think anybody serious in the AI world. I distinguish that from what people say on Twitter, but I don't think anybody serious thinks that it's just a matter of scaling anything on the table. I think everybody believes that there are research breakthroughs that need to happen. I think the trillion dollar question here is really how hard are those breakthroughs?

If you listen to Noam Brown from OpenAI, who's the person who led the o1 project, he says, there's absolutely still breakthroughs that need to happen. He thinks with high confidence that those breakthroughs are all smaller than the ones that have already been achieved.

And so that's the confidence inside the lab, but at the end of the day, they're breakthroughs of an unknown type. And so we don't know how hard they are. And I think it's ultimately going to be an empirical question.

Discussion about this episode