41 Comments

The simplest task I've found that breaks every LLM is:

> Please multiply the following two numbers 45*58 using the "grade school" arithmetic algorithm, showing all steps including carries.

You can choose longer random numbers to increase the difficulty. I found that o1 could usually multiply five digit numbers but not six. 4o could multiply 2-digit or sometimes 3-digit numbers. Given that the number of steps in long multiplication is quadratic in the number of digits, that's a pretty big improvement!

Expand full comment

That's very interesting, especially the part about the number of CoT tokens increasing sublinearly. I guess by specifying the algorithm my prompt made it harder, though more directly comparable to human ability. I only tried a few examples, but o1 got every 6x6 multiplication I tried wrong, well less than 84.6%.

Expand full comment

I suspected o1 would struggle with spatial reasoning too but found it was pretty good at solving NYT Strands puzzles, clearly a highly visual puzzle that does require backtracking as well.

https://chatgpt.com/share/66ed85e1-1d38-8002-b298-4c738ec1ea54

Same with Sudoku: https://chatgpt.com/share/66e9c33d-7120-8002-abaf-32711a40d0e2

Expand full comment

The claimed solution to the Sudoku problem at https://chatgpt.com/share/66e9c33d-7120-8002-abaf-32711a40d0e2 is not correct. The far right column and the bottom right block both have two 9's and no 3's.

2 6 4 |5 7 8 |3 1 9

3 7 5 |1 6 9 |4 2 8

8 1 9 |7 3 2 |6 5 4

------+------+------

5 2 1 |8 4 6 |7 3 9

7 4 6 |2 9 5 |1 8 5

9 3 8 |1 7 4 |5 6 2

------+------+------

6 8 2 |3 5 1 |9 4 7

4 5 7 |9 2 3 |8 9 1

1 9 3 |6 8 7 |2 5 6

Expand full comment

There are also other problems with the claimed solution. E.g., The middle column has no 1's. The upper and lower middle blocks have no 4's. The center block has two 4's. There may be others.

Expand full comment

o1 is a very notable improvement. Other kinds of modeling will take care of other kinds of shortcomings. A physics simulator could help it get better at spatial intelligence, by giving it a playground where it could make and validate hypotheses.

Expand full comment

Maybe! I'm a bit skeptical. I think it might take some architectural innovations to model other kinds of data. But we'll see.

Expand full comment

Just saying put LLM and simulator together won't be enough, of course.

A step forward could be for LLM to generate code that will build up a virtual world in a simulator reflecting the problem being solved. Then, if simulations are run, and there are ways of observing the outcomes, and on refining the simulation, it could give the AI agent ideas about what is going on, if the agent can reason.

Expand full comment

To add, AlphaGo excelled by doing self-play. For that, need to have rules or a model in which to validate the potential ideas. Maybe that could be extended to situations where instead of rules of Go there's properties of physics.

Expand full comment

I'm still chuckling at the idea of a car filled with helium floating away.

Expand full comment

Your headline had me expecting that you had discovered quite a bit more than minor improvements to GPT's ability to solve language puzzles. The fact that the latest release has become slightly better at playing language games falls squarely in the "that's neat, but what does this really get us?" category, sort of like the new podcasting capabilities of NotebookLM.

I know you avoid scare quotes around "reason" when it comes to LLMs and I agree that semantic wrangling about our descriptive vocabulary doesn't get us anywhere. However, "doing fairly crude pattern-matching" and "quickly get bogged down" seems more like ordinary machine learning capabilities than an "extraordinary ability."

Expand full comment

It's hard to calibrate stuff like this. If you'd told me three years ago that an AI system would be able to get a score of 10/15 on the AIME I would have thought that was incredible. We have not achieved human-level intelligence across the board, but we are near human-level on a significant slice of reasoning problems. I think that's extraordinary—both because these tools are important in their own right and because they are likely to be a step toward even more impressive systems in the future.

Expand full comment

Entirely agree about the surprising and impressive leap in capabilities in 2022 and that, regardless of the whether it is a step toward AGI, that leap will be the basis of continued improvements. This is why my favorite piece of yours so far this year was about developers of self-driving cars using the language capabilities of LLMs to solve complex traffic problems. My mind is reblown each time I think about that.

At the same time, the impressive results on AIME and MMLU have me reconsidering some of my assumptions about what cognitive tests measure. I get that all we can compare LLMs to is other LLMs or to humans, but that is a limited frame to evaluate their capabilities.

Expand full comment

Yeah, we should be very skeptical that cognitive tests measure cognitive ability when taken by programs that have familiarized themselves with the internet. In 1990, 10 year olds taking IQ tests demonstrated a range of levels of cognitive ability, but they had not familiarized themselves with even one word of the internet, so what does it mean when a program that is literally designed to familiarize itself with the internet does well on these same tests? Maybe not much.

Expand full comment

> 20 six-sided dice cannot add up to 250

Your conversation with GPT-o1 asked about 40 die rolls, not 20. Same answer of course, but perhaps a slightly harder problem.

Expand full comment

Ah, I should have said "20 pairs of six-sided dice." Thanks for catching that.

Expand full comment

Most humans would fail the spatial reasoning puzzles without paper as well. Your conclusion that LLMs can’t replace humans yet because of knowledge seems odd since that’s an area where LLMs excel. What you’ve shown is that they can’t replace top STEM talent because of deficits in reasoning / mechanical ability. LLMs are probably better at summarization tasks than the 80th percentile person. Almost all workers with decades of experience are hardly true experts. We have literature on the diminishing returns of skill acquisition. The majority of the population already retains very little from schooling.

Expand full comment

The o1 breakthrough is not only the training process, it must also have a different architecture. The pure transformer based LLMs always ‘think’ for the same time, o1 thinks harder on hard problems.

Expand full comment

It's important that o1 can backtrack. Without backtracking, all outputs would be of the same (linear) computational complexity. But even with backtracking o1 made errors on Sudoku. (See my comments to Clyde Wright.)

Expand full comment

Once o1 gets cheaper and faster things will start getting interesting. A model trained specifically for CoT was always going to be pretty solid, but they’re really pushing the limits of LLM architecture from a cost/energy standpoint. It’s basically like running 10 GPT-4s

Expand full comment

Hi Tim, the art diagram of the chess board u provided, was it the same one as shown in the article? if so, it is missing the pawns on the e file, and therefore Qe7 comes with check and is pretty much the same strength as taking according the chess . com engine

Expand full comment

Is it good at anything else besides math and basic logic?

This is a really good article by the way.

Expand full comment

I would say it's on par with GPT-4o on topics that don't require in-depth reasoning.

Expand full comment

Oh ok good to know!

Expand full comment

LLM’s usually run on very powerful computers. That, of course, have arithmetics at their core(s). Why doesn’t the LLM simply subcontract all calculations “downstairs”?

Expand full comment

GPT-4o has the ability to write and execute Python scripts, so it will sometimes do that: recognize that something is a math problem, write a script to do the calculation, and then return the output of the script. But the goal is to build a system that can do a much broader range of math problems than just arithmetic, and this approach doesn't necessarily generalize to those other problems.

Expand full comment

A human biologist runs their mind on a brain, which (among other things) includes lots of proteins that fold. But if you ask a human biologist how a particular common protein folds, they can’t just subcontract this “downstairs” - instead they have to think using whatever brain systems have learned to think explicitly about these sorts of questions.

The two limitations are actually similar in many ways. In both the human and the LLM, the higher level system represents information in very different ways from how the same information is represented at the lower level. If you could figure out a way to translate the lower level information so that it is available to the higher level system, and vice versa, you could enable a lot of interesting interactions.

A very minor version of this might be skilled meditators who learn how to increase or decrease their heart rate (which of course doesn’t get anywhere near as low level as protein folding - it’s unclear that this sort of introspective access would even be possible.)

Expand full comment

Great comment!

Expand full comment

It appears that enabling models to verbalize their internal thoughts has enhanced their reasoning abilities in language tasks. Similarly, to develop stronger spatial intelligence, models might need the capability to internally simulate spatial scenarios—much like the videos generated by AI systems like Sora—to help them explore spatial possibilities.

Expand full comment

Here's a problem all humans should be able to solve, but it will be interesting if any so-called AI can even BEGIN scratching the surface of the problem:

I want you to get three black balls, three white balls and three buckets. Put a bucket in front of you, one to the left of it and one to the right of it. Put the black balls in the left bucket, and the white balls in the center bucket. Now I'll want you to play a role-playing game. You will be either Jack, George or Chad. If you're Jack, you will take a ball from a particular bucket (assuming there is any) and look at it. I want you to remember what was the last bucket you reached into. If it was the left bucket, you will reach into the center bucket. If it was the center bucket, you will reach into the right bucket and if it was the right bucket, you will reach into the left bucket. If you didn't reach into a bucket before, reach into the left bucket. If, when you look into your hand, the ball you're holding is black, I want you to become George. If it's white, I want you to become Chad. And if your hand is empty, play again as Jack. If you are playing as George, put the ball into the right bucket then become Jack again. If you're playing as Chad, put the ball into the center bucket then become Jack. Count the number of times you've become Chad. When you become Chad for the third time, stop the play. Now become Jack. :) After you end the play, which balls are in which buckets?

Yes, recursive functions. You can do it, can an "AI" do it?

Expand full comment

The point about selective attention / forgetting is particularly interesting. Humans get bogged down as well when we can't figure out which aspects of a larger problem to focus on, but luckily we're still better at focus than LLMs. I wonder how the AI companies will tackle that. It seems like current approaches are typically brute force (try a bunch of things, see which ones work, and discard failing attempts or backtrack), but I imagine you'll need to build better ability to decide which approaches to a problem to try ahead of time.

Expand full comment

It was quite easy to beat 4o on Reverse Tic-Tac-Toe. Without access to o1, I haven't been able to try it there. See: https://russabbott.substack.com/p/reverse-tic-tac-toe-with-chatgpt

Expand full comment