Regarding the "lesson to draw", I think you have laid it out beautifully. All that remains is to give it its name: **combination**. Per Brian Arthur[1] (and Christopher Alexander, and ...), a new technology is a new combination of existing technologies (which are, recursively, once-new combinations of even older technologies).
As you explain here, it took decades for the separate components to be individually assembled—only then could the final combination take place and produce the enormous leap in usefulness.
Corollary: the conventional wisdom was *right*. The components of AlexNet and its descendants *didn't* work. It was only the emergent combination that did.
Similar points were made by Christopher Alexander (of _A Pattern Language_ fame, which inspired what software types call "design patterns") in his brilliant/weird monograph _Notes on the Synthesis of Form_.
It occurred to me after commenting that your history of neural networks here looks a bit like a technology evolving / being augmented-and-replaced while keeping a single name through several significant structural shifts. I wrote about how the history of "computers" followed that path:
1. Our first general purpose computers were roughly translations of the ideal Turing machine into concrete and finite mechanisms—and so, deserved the name. They actually *compute*.
2. Then Hopper et alia wrapped that machine ("von Neumann" or whatever you want to call the '46 generation) in a second, qualitatively different machine: the compiler-and-code mechanism of IBM and the seven dwarves *while calling it by the same name*, "computer". That's a misnomer: the computer-per-se inside them (roughly, the processor), powers their execution of programs. If they have a computer, they can't both be a computer and be so different from the computer they have.
3. ... more evolution through the '60s until Engelbart et alia wrapped Hopper's machine in a "GUI" (Bush's '45 memex, evolved) ... and called that third machine (properly: oN-Line System) a "computer", when it would be more accurately called a computer-powered set of versatile information machines.
4. Today's laptops are highly evolved descendants of NLS and we still call them "computers". I'd prefer "memex".
Great piece Timothy! I can say deep learning boom caught ME by surprise...
You see, I spent my 30+ years in AI working in areas related to symbolic AI, like knowledge representation, automated reasoning, intelligent agents, etc. I never thought that those "subsymbolic" (notice the contempt) neural networks were capable of anything beyond character recognition.
And here we are today, discussing when the new AI systems will get to the AGI level (not in this decade, I'd say).
"So the [AI] boom of the last [12] years was made possible by [three] visionaries who pursued unorthodox ideas in the face of widespread criticism"
This is the crux of the matter and surely generically applies to most all advancements / innovation / discovery that have ever taken place throughout time. You could replace the bracketed bits with virtually anything.
The pertinent question(s) seem to be: "If we continue to (try and) outsource our thinking to the machines / AI as appears to be the wont of the current trajectory of LLMs etc., do we reach a position where we have become devoid of the visionaries and advancement ends?
And/or is it likely that AI / machines can ever become the visionaries instead? Unconventional wisdom depends on the human capacity to 'trust their guts'... something that machines fundamentally can't (currently) do...
ImageNet isolated a problem of obvious practical interest (classifying images) that existing approaches did badly at and turned it into a benchmark. Along the lines of your musings in the final section, I wonder what is the "next ImageNet" that will validate a new approach over the current paradigm. So far, LLMs are great at conquering benchmarks, and the exceptions (such as ARC-AGI) are of debatable practical relevance.
LLM doing so well tells more about weaknesses in benchmarks than LLM being great.
ARC-AGI tried to test in a toy context spatial reasoning and visual pattern understanding. It is of a lot of practical relevance.
LLM does not understand physics, fine-grained manipulation, time-variable processes, compositional reasoning. Stuff we do all the time.
I am not saying LLM is a dead-end, in fact it is an amazing success story. It is the first general-purpose but shallow machine we have come up with, that can learn purely from lots of examples without getting lost.
But an AI will use LLM only as a way of generating guesses. Without precise modeling of individual circumstances, the way we do natively, it can't resolve fine-level phenomena.
"ARC-AGI tried to test in a toy context spatial reasoning and visual pattern understanding. It is of a lot of practical relevance."
A technique that can classify ImageNet has immediate practical value through applications such as iNaturalist (for recognizing e.g. plant species).
With ARC-AGI, there might be a solution that is specific to that benchmark and has no wider applicability. The creators don't dispute that possibility, though they hope that the solution method will generalize.
"LLM does not understand physics, fine-grained manipulation, time-variable processes, compositional reasoning. Stuff we do all the time."
So where are the benchmarks where LLMs perform consistently badly, and, if a new approach did better, it would be of immediate practical value? I'm not claiming that such benchmarks can't exist, just that I'm not aware of any.
(The closest I can think of are benchmarks such as SWE-bench that focus on practical usefulness, but LLM-based approaches haven't hit a wall on such benchmarks yet.)
ARC-AGI's benchmark is artificial, but the ability to discern patterns in noisy and vague situations is important in the real world, and a neural net can't figure those out unless trained with very similar data.
The state of benchmarks is not great, that's part of the problem. I think as current benchmarks saturate we'll need to produce more intricate ones, including for step-by-step reasoning.
I think we are on the right track, especially with systems as o1 which go beyond LLM's instant generation paradigm, and incorporate iterative problem solving.
It will likely work out as with AlphaGo, where neural nets were used to do fast approximate matching narrowing down the search space, but then an honest expert was called to carefully figure out the details.
Simply scaling up neural nets is not enough. There is immense complexity at the fine level that even a huge neural net cannot capture.
That's why the focus on language models is so important. It allow one to go from messy specifics to a high-level overview. That comes at the cost of precision, but this way it is easier to see where in the problem space one is, and to find good candidate strategies, given other similarly posed problems.
Then an agent would diligently try generated strategies, while checking often how it is doing. Sometimes that may fail and one has to restart with a different lead.
In short, AI agents will work as people. Use experience, guesswork, inspections, external tools. Rather than training for everything from the outset.
We sometimes use a "baking a cake" analogy to explain how AI works: The ingredients are the data, the recipe is the algorithm, and the compute is the oven. So you chose a lovely trio.
unfortunate to perpetuate certain technical myths. for instance, what nvidia calls "cores" are not cores - they are SIMD lanes. and Jensen did not invent either CUDA nor the GPU, nor was CUDA even the first.
why publish stuff you had to know is wrong? isn't it unsatisfying to simplify narrative so far that it's untrue?
Good write-up. Interesting to see how the different puzzle pieces were all needed and are fitting together to start the deep learning revolution und which resulted in today's LLMs.
Really valuable history, thanks so much.
Regarding the "lesson to draw", I think you have laid it out beautifully. All that remains is to give it its name: **combination**. Per Brian Arthur[1] (and Christopher Alexander, and ...), a new technology is a new combination of existing technologies (which are, recursively, once-new combinations of even older technologies).
As you explain here, it took decades for the separate components to be individually assembled—only then could the final combination take place and produce the enormous leap in usefulness.
Corollary: the conventional wisdom was *right*. The components of AlexNet and its descendants *didn't* work. It was only the emergent combination that did.
[1] https://sites.santafe.edu/~wbarthur/thenatureoftechnology.htm
That's really interesting, I wasn't familiar with Arthur's book!
Similar points were made by Christopher Alexander (of _A Pattern Language_ fame, which inspired what software types call "design patterns") in his brilliant/weird monograph _Notes on the Synthesis of Form_.
It occurred to me after commenting that your history of neural networks here looks a bit like a technology evolving / being augmented-and-replaced while keeping a single name through several significant structural shifts. I wrote about how the history of "computers" followed that path:
1. Our first general purpose computers were roughly translations of the ideal Turing machine into concrete and finite mechanisms—and so, deserved the name. They actually *compute*.
2. Then Hopper et alia wrapped that machine ("von Neumann" or whatever you want to call the '46 generation) in a second, qualitatively different machine: the compiler-and-code mechanism of IBM and the seven dwarves *while calling it by the same name*, "computer". That's a misnomer: the computer-per-se inside them (roughly, the processor), powers their execution of programs. If they have a computer, they can't both be a computer and be so different from the computer they have.
3. ... more evolution through the '60s until Engelbart et alia wrapped Hopper's machine in a "GUI" (Bush's '45 memex, evolved) ... and called that third machine (properly: oN-Line System) a "computer", when it would be more accurately called a computer-powered set of versatile information machines.
4. Today's laptops are highly evolved descendants of NLS and we still call them "computers". I'd prefer "memex".
More here: http://whatarecomputersfor.net
Great piece Timothy! I can say deep learning boom caught ME by surprise...
You see, I spent my 30+ years in AI working in areas related to symbolic AI, like knowledge representation, automated reasoning, intelligent agents, etc. I never thought that those "subsymbolic" (notice the contempt) neural networks were capable of anything beyond character recognition.
And here we are today, discussing when the new AI systems will get to the AGI level (not in this decade, I'd say).
"So the [AI] boom of the last [12] years was made possible by [three] visionaries who pursued unorthodox ideas in the face of widespread criticism"
This is the crux of the matter and surely generically applies to most all advancements / innovation / discovery that have ever taken place throughout time. You could replace the bracketed bits with virtually anything.
The pertinent question(s) seem to be: "If we continue to (try and) outsource our thinking to the machines / AI as appears to be the wont of the current trajectory of LLMs etc., do we reach a position where we have become devoid of the visionaries and advancement ends?
And/or is it likely that AI / machines can ever become the visionaries instead? Unconventional wisdom depends on the human capacity to 'trust their guts'... something that machines fundamentally can't (currently) do...
Today’s ‘too far’ is tomorrow’s foundation.
Fantastic article - thanks for sharing, great to see all the history together in one story!
ImageNet isolated a problem of obvious practical interest (classifying images) that existing approaches did badly at and turned it into a benchmark. Along the lines of your musings in the final section, I wonder what is the "next ImageNet" that will validate a new approach over the current paradigm. So far, LLMs are great at conquering benchmarks, and the exceptions (such as ARC-AGI) are of debatable practical relevance.
LLM doing so well tells more about weaknesses in benchmarks than LLM being great.
ARC-AGI tried to test in a toy context spatial reasoning and visual pattern understanding. It is of a lot of practical relevance.
LLM does not understand physics, fine-grained manipulation, time-variable processes, compositional reasoning. Stuff we do all the time.
I am not saying LLM is a dead-end, in fact it is an amazing success story. It is the first general-purpose but shallow machine we have come up with, that can learn purely from lots of examples without getting lost.
But an AI will use LLM only as a way of generating guesses. Without precise modeling of individual circumstances, the way we do natively, it can't resolve fine-level phenomena.
"ARC-AGI tried to test in a toy context spatial reasoning and visual pattern understanding. It is of a lot of practical relevance."
A technique that can classify ImageNet has immediate practical value through applications such as iNaturalist (for recognizing e.g. plant species).
With ARC-AGI, there might be a solution that is specific to that benchmark and has no wider applicability. The creators don't dispute that possibility, though they hope that the solution method will generalize.
"LLM does not understand physics, fine-grained manipulation, time-variable processes, compositional reasoning. Stuff we do all the time."
So where are the benchmarks where LLMs perform consistently badly, and, if a new approach did better, it would be of immediate practical value? I'm not claiming that such benchmarks can't exist, just that I'm not aware of any.
(The closest I can think of are benchmarks such as SWE-bench that focus on practical usefulness, but LLM-based approaches haven't hit a wall on such benchmarks yet.)
There have been papers that show that LLM can do not so well on reasoning. Here is an example here: https://news.mit.edu/2024/reasoning-skills-large-language-models-often-overestimated-0711
ARC-AGI's benchmark is artificial, but the ability to discern patterns in noisy and vague situations is important in the real world, and a neural net can't figure those out unless trained with very similar data.
The state of benchmarks is not great, that's part of the problem. I think as current benchmarks saturate we'll need to produce more intricate ones, including for step-by-step reasoning.
I think we are on the right track, especially with systems as o1 which go beyond LLM's instant generation paradigm, and incorporate iterative problem solving.
It will likely work out as with AlphaGo, where neural nets were used to do fast approximate matching narrowing down the search space, but then an honest expert was called to carefully figure out the details.
Simply scaling up neural nets is not enough. There is immense complexity at the fine level that even a huge neural net cannot capture.
That's why the focus on language models is so important. It allow one to go from messy specifics to a high-level overview. That comes at the cost of precision, but this way it is easier to see where in the problem space one is, and to find good candidate strategies, given other similarly posed problems.
Then an agent would diligently try generated strategies, while checking often how it is doing. Sometimes that may fail and one has to restart with a different lead.
In short, AI agents will work as people. Use experience, guesswork, inspections, external tools. Rather than training for everything from the outset.
Your comment reminds me to the reasoning models started by o1 from OpenAI which shifted some of the compute from training time to inference time.
We sometimes use a "baking a cake" analogy to explain how AI works: The ingredients are the data, the recipe is the algorithm, and the compute is the oven. So you chose a lovely trio.
This is a reminder that great ideas often seem absurd at first—until they revolutionize everything.
unfortunate to perpetuate certain technical myths. for instance, what nvidia calls "cores" are not cores - they are SIMD lanes. and Jensen did not invent either CUDA nor the GPU, nor was CUDA even the first.
why publish stuff you had to know is wrong? isn't it unsatisfying to simplify narrative so far that it's untrue?
Good write-up. Interesting to see how the different puzzle pieces were all needed and are fitting together to start the deep learning revolution und which resulted in today's LLMs.
Amazing read!