A DeepMind AI rivals the world's smartest high schoolers at geometry
It's another DeepMind effort to combine language models and search.
A system developed by Google’s DeepMind has set a new record for AI performance on geometry problems. DeepMind’s AlphaGeometry managed to solve 25 of the 30 geometry problems drawn from the International Mathematical Olympiad between 2000 and 2022.
That puts the software ahead of the vast majority of young mathematicians, and just shy of IMO gold medalists. DeepMind estimates that the average gold medalist would have solved 26 out of 30 problems. Many view the IMO as the world’s most prestigious math competition for high school students.
“Because language models excel at identifying general patterns and relationships in data, they can quickly predict potentially useful constructs, but often lack the ability to reason rigorously or explain their decisions,” DeepMind writes. To overcome this difficulty, DeepMind paired a language model with a more traditional symbolic deduction engine that performs algebraic and geometric reasoning.
The research was led by Trieu Trinh, a computer scientist who recently earned his PhD from New York University. He was a resident at DeepMind between 2021 and 2023.
Evan Chen, a former Olympiad gold medalist who evaluated some of AlphaGeometry’s output, praised it as “impressive because it's both verifiable and clean.” Whereas some earlier software generated complex geometry proofs that were hard for human reviewers to understand, the output of AlphaGeometry is similar to what a human mathematician would write.
AlphaGeometry is part of DeepMind’s larger project to improve the reasoning capabilities of large language models by combining them with traditional search algorithms. DeepMind has published several papers in this area over the last year.
How AlphaGeometry works
Let’s start with a simple example shown in the AlphaGeometry paper, which was published by Nature on Wednesday:
The goal is to prove that if a triangle has two equal sides (AB and AC), then the angles opposite those sides will also be equal. We can do this by creating a new point D at the midpoint of the third side of the triangle (BC). It’s easy to show that all three sides of triangle ABD are the same length as corresponding sides of triangle ACD. And two triangles with equal sides always have equal angles.
Geometry problems from the IMO are much more complex than this toy problem, but fundamentally they have the same structure. They all start with a geometric figure and some facts about the figure like “side AB is the same length as side AC.” The goal is to generate a sequence of valid inferences that conclude with a given statement like “angle ABC is equal to angle BCA.”
For many years we’ve had software that can generate lists of valid conclusions that can be drawn from a set of starting assumptions. Simple geometry problems can be solved by “brute force”: mechanically listing every possible fact that can be inferred from the given assumption, then listing every possible inference from those facts, and so on until you reach the desired conclusion.
But this kind of brute-force search isn’t feasible for an IMO-level geometry problem because the search space is too large. Not only do harder problems require longer proofs, but sophisticated proofs often require the introduction of new elements to the initial figure—as with point D in the above proof. Once you allow for these kinds of “auxiliary points,” the space of possible proofs explodes and brute-force methods become impractical.
So mathematicians need to develop an intuition about which proof steps are likely to lead to a successful result. DeepMind’s breakthrough was to use a language model to provide the same kind of intuitive guidance to an automated search process.
The downside to a language model is that it is not great at deductive reasoning—language models can sometimes “hallucinate” and reach conclusions that don’t actually follow from the given premises. So the DeepMind team developed a hybrid architecture. There’s a symbolic deduction engine that mechanically derives conclusions that logically follow from the given premises. But periodically control will pass to a language model that will take a more “creative” step like adding a new point to the figure.
What makes this tricky is that it takes a lot of data to train a new language model, and there aren’t nearly enough examples of difficult geometry problems. So instead of relying on human-designed geometry problems, Trinh and his DeepMind colleagues generated a huge database of challenging geometry problems from scratch.
To do this, the software would generate a series of random geometric figures like those illustrated above. Each had a set of starting assumptions. The symbolic deduction engine would generate a list of facts that follow logically from the starting assumptions, then more claims that follow from those deductions, and so forth. Once there was a long enough list, the software would pick one of the conclusions and “work backwards” to find the minimum set of logical steps required to reach the conclusion. This list of inferences is a proof of the conclusion, and so it can become a problem in the training set.
Sometimes a proof would reference a point in the figure, but the proof didn’t depend on any initial assumptions about that point. In those cases, the software could remove that point from the problem statement but then introduce the point as part of the proof. In other words, it could treat this point as an “auxiliary point” that needed to be introduced to complete the proof. These examples helped the language model to learn when and how it was helpful to add new points to complete a proof.
In total, DeepMind generated 100 million synthetic geometry proofs, including almost 10 million that required introducing “auxiliary points” as part of the solution. During the training process, DeepMind placed extra emphasis on examples involving auxiliary points to encourage the model to take these more creative steps when solving real problems.
This is the DeepMind playbook
If you’ve been following DeepMind’s work over the years, a lot of this should sound familiar. One of DeepMind’s most famous results was AlphaZero, an AI system that taught itself to play Go and Chess by playing against itself. The game of Go is so complex that it’s not practical to do a brute-force search of possible move sequences. So DeepMind built a neural network to give its AI an “intuition” about which moves were most promising. This network was trained on synthetic data generated by having AlphaZero play games against itself.
More recently, DeepMind has been experimenting with ways to combine language models with AlphaZero-style tree search. In a widely cited paper last May, DeepMind researchers introduced the idea of “Tree of Thoughts” language models. Instead of just generating a single chain of reasoning, a Tree of Thoughts model systematically explores multiple reasoning chains that “branch off” in different directions.
Last month, DeepMind announced FunSearch, which used a language model to generate computer programs to solve difficult math problems like the online bin-packing problem. FunSearch used a genetic algorithm rather than a tree search to explore the space of possible programs. But at a more abstract level, it used the same basic approach as AlphaGeometry: use a language model as a source of “creativity,” then use a more traditional deterministic program to measure the quality of each proposed solution.
DeepMind co-founder and chief scientist Shane Legg explained this basic approach in an October podcast interview.
“These foundation models are world models of a kind, and to do really creative problem-solving, you need to start searching,” Legg said. “To get real creativity you need to search through spaces of possibilities and find these hidden gems.”
Tree of Thoughts, FunSearch, and now AlphaGeometry are all variants on this basic theme. Language models are a good source of promising ideas, but they aren’t great at logical reasoning. Sometimes they make mistakes or get confused. So DeepMind has been experimenting with embedding language models in larger systems that can engage in more systematic reasoning.
And there are rumors that OpenAI is working on similar approaches. Last November, in the wake of Sam Altman’s firing and re-hiring as CEO of OpenAI, there were reports that OpenAI was building a mysterious project called Q*. As I wrote at the time, it’s likely that this was an attempt to combine language models with more traditional search algorithms.
I have no doubt that this approach will yield some important breakthroughs. But I’m not convinced—as some people are—that this will be an important step toward truly general intelligence. What FunSearch and AlphaGeometry have in common is that both are limited to domains where we have automated ways of evaluating proposed solutions. And this means that language models don’t need to understand the problems they’re solving in any depth—they just need to generate a lot of plausible-sounding solutions and then hand off the result to another algorithm to check if they are any good.
But this kind of crutch isn’t available for most of the intellectual problems humanity would like to solve. Making progress on most hard problems requires mastering a lot of messy facts about the world and gaining deep conceptual insights about how the world works. It’s not obvious how one would extend the techniques in projects like FunSearch and AlphaGeometry to tackle these more difficult problems.
Great piece. Lots of impressive big AI programs that do specific narrow tasks. Even generating AI art is pretty specific, though not in the way you're talking about here.
All these programs impress me more than the more general bullshit machine that is ChatGpt. (But many people love ChatGpt obviously.)
A few useful other notes about this work:
1. Even though pure search methods are intractable for this problem, the best Computer Algebra based solution gets about 10 of the IMO problems right. The paper says basically that they don't care about this approach because it isn't relevant to broader AI questions they are interested in. (I found that disappointing.)
2. They show in the paper that just the deductive models they built get 21 problems right, so that already is well past the SOTA and close to gold medalist level, even without the deep learning aspect.