Why the deep learning boom caught almost everyone by surprise
"You’ve taken this idea way too far," a mentor told Prof. Fei-Fei Li.
During my first semester as a computer science graduate student at Princeton, I took COS 402: Artificial Intelligence. Toward the end of the semester there was a lecture about neural networks. This was in the fall of 2008, and I got the distinct impression—both from that lecture and the textbook—that neural networks had become a backwater.
Neural networks had delivered some impressive results in the late 1980s and early 1990s. But then progress stalled. By 2008, many researchers had moved on to mathematically elegant approaches such as support vector machines.
I didn’t know it at the time, but a team at Princeton—in the same computer science building where I was attending lectures—was working on a project that would upend the conventional wisdom and demonstrate the power of neural networks. That team, led by Prof. Fei-Fei Li, wasn’t working on a better version of neural networks. They were hardly thinking about neural networks at all.
Rather, they were creating a new image dataset that would be far larger than any that had come before: 14 million images, each labeled with one of nearly 22,000 categories.
Li tells the story of ImageNet in her recent memoir, The Worlds I See. As she worked on the project, she faced a lot of skepticism from friends and colleagues.
“I think you’ve taken this idea way too far,” a mentor told her a few months into the project in 2007. “The trick is to grow with your field. Not to leap so far ahead of it.”
It wasn’t just that building such a large dataset was a massive logistical challenge. People doubted the machine learning algorithms of the day would benefit from such a vast collection of images.
“Pre-ImageNet, people did not believe in data,” Li said in a September interview at the Computer History Museum. “Everyone was working on completely different paradigms in AI with a tiny bit of data.”
Ignoring negative feedback, Li pursued the project for more than two years. It strained her research budget and the patience of her graduate students. When she took a new job at Stanford in 2009, she took several of those students—and the ImageNet project—with her to California.
ImageNet received little attention for the first couple of years after its release in 2009. But in 2012, a team from the University of Toronto trained a neural network on the ImageNet dataset, achieving unprecedented performance in image recognition. That groundbreaking AI model, dubbed AlexNet after lead author Alex Krizhevsky, kicked off the deep learning boom that has continued until the present day.
AlexNet would not have succeeded without the ImageNet dataset. AlexNet also would not have been possible without a platform called CUDA that allowed Nvidia’s graphics processing units (GPUs) to be used in non-graphics applications. Many people were skeptical when Nvidia announced CUDA in 2006.
So the AI boom of the last 12 years was made possible by three visionaries who pursued unorthodox ideas in the face of widespread criticism. One was Geoffrey Hinton, a University of Toronto computer scientist who spent decades promoting neural networks despite near-universal skepticism. The second was Jensen Huang, the CEO of Nvidia, who recognized early that GPUs could be useful for more than just graphics.
The third was Fei-Fei Li. She created an image dataset that seemed ludicrously large to most of her colleagues. But it turned out to be essential for demonstrating the potential of neural networks trained on GPUs.
Geoffrey Hinton
A neural network is a network of thousands, millions, or even billions of neurons. Each neuron is a mathematical function that produces an output based on a weighted average of its inputs.
Suppose you want to create a network that can identify handwritten decimal digits like the number two in the red square above. Such a network would take in an intensity value for each pixel in an image and output a probability distribution over the ten possible digits—0, 1, 2, and so forth.
To train such a network, you first initialize it with random weights. Then you run it on a sequence of example images. For each image, you train the network by strengthening the connections that push the network toward the right answer (in this case, a high probability value for the “2” output) and weakening connections that push toward a wrong answer (a low probability for “2” and high probabilities for other digits). If trained on enough example images, the model should start to predict a high probability for “2” when shown a two—and not otherwise.
In the late 1950s, scientists started to experiment with basic networks that had a single layer of neurons. However, their initial enthusiasm cooled as they realized that such simple networks lacked the expressive power required for complex computations.
Deeper networks—those with multiple layers—had the potential to be more versatile. But in the 1960s, no one knew how to train them efficiently. This was because changing a parameter somewhere in the middle of a multi-layer network could have complex and unpredictable effects on the output.
So by the time Hinton began his career in the 1970s, neural networks had fallen out of favor. Hinton wanted to study them, but he struggled to find an academic home to do so. Between 1976 and 1986, Hinton spent time at four different research institutions: Sussex University, the University of California San Diego (UCSD), a branch of the UK Medical Research Council, and finally Carnegie Mellon, where he became a professor in 1982.
In a landmark 1986 paper, Hinton teamed up with two of his former colleagues at UCSD, David Rumelhart and Ronald Williams, to describe a technique called backpropagation for efficiently training deep neural networks.
Their idea was to start with the final layer of the network and work backwards. For each connection in the final layer, the algorithm computes a gradient—a mathematical estimate of whether increasing the strength of that connection would push the network toward the right answer. Based on these gradients, the algorithm adjusts each parameter in the model’s final layer.
The algorithm then propagates these gradients backwards to the second-to-last layer. A key innovation here is a formula—based on the chain rule from high school calculus—for computing the gradients in one layer based on gradients in the following layer. Using these new gradients, the algorithm updates each parameter in the second-to-last layer of the model. Then the gradients get propagated backwards to the third-to-last layer and the whole process repeats once again.
The algorithm only makes small changes to the model in each round of training. But as the process is repeated over thousands, millions, billions, or even trillions of training examples, the model gradually becomes more accurate.
Hinton and his colleagues weren’t the first to discover the basic idea of backpropagation. But their paper popularized the method. As people realized it was now possible to train deeper networks, it triggered a new wave of enthusiasm for neural networks.
Hinton moved to the University of Toronto in 1987 and began attracting young researchers who wanted to study neural networks. One of the first was the French computer scientist Yann LeCun, who did a year-long postdoc with Hinton before moving to Bell Labs in 1988.
Hinton’s backpropagation algorithm allowed LeCun to train models deep enough to perform well on real-world tasks like handwriting recognition. By the mid-1990s, LeCun’s technology was working so well that banks started to use it for processing checks.
“At one point, LeCun’s creation read more than 10 percent of all checks deposited in the United States,” wrote Cade Metz in his 2022 book Genius Makers.
But when LeCun and other researchers tried to apply neural networks to larger and more complex images, it didn’t go well. Neural networks once again fell out of fashion, and some researchers who had focused on neural networks moved on to other projects.
Hinton never stopped believing that neural networks could outperform other machine learning methods. But it would be many years before he’d have access to enough data and computing power to prove his case.
Jensen Huang
The brains of every personal computer is a central processing unit (CPU). These chips are designed to perform calculations in order, one step at a time. This works fine for conventional software like Windows and Office. But some video games require so many calculations that they strain the capabilities of CPUs. This is especially true of games like Quake, Call of Duty, and Grand Theft Auto that render three-dimensional worlds many times per second.
So gamers rely on GPUs to accelerate performance. Inside a GPU are many execution units—essentially tiny CPUs—packaged together on a single chip. During gameplay, different execution units draw different areas of the screen. This parallelism enables better image quality and higher frame rates than would be possible with a CPU alone.
Nvidia invented the GPU in 1999 and has dominated the market ever since. By the mid-2000s, Nvidia CEO Jensen Huang suspected that the massive computing power inside a GPU would be useful for applications beyond gaming. He hoped scientists could use it for compute-intensive tasks like weather simulation or oil exploration.
So in 2006, Nvidia announced the CUDA platform. CUDA allows programmers to write “kernels,” short programs designed to run on a single execution unit. Kernels allow a big computing task to be split up into bite-sized chunks that can be processed in parallel. This allows certain kinds of calculations to be completed far faster than with a CPU alone.
But there was little interest in CUDA when it was first introduced, wrote Steven Witt in the New Yorker last year:
When CUDA was released, in late 2006, Wall Street reacted with dismay. Huang was bringing supercomputing to the masses, but the masses had shown no indication that they wanted such a thing.
“They were spending a fortune on this new chip architecture,” Ben Gilbert, the co-host of “Acquired,” a popular Silicon Valley podcast, said. “They were spending many billions targeting an obscure corner of academic and scientific computing, which was not a large market at the time—certainly less than the billions they were pouring in.”
Huang argued that the simple existence of CUDA would enlarge the supercomputing sector. This view was not widely held, and by the end of 2008 Nvidia’s stock price had declined by seventy per cent…
Downloads of CUDA hit a peak in 2009, then declined for three years. Board members worried that Nvidia’s depressed stock price would make it a target for corporate raiders.
Huang wasn’t specifically thinking about AI or neural networks when he created the CUDA platform. But it turned out that Hinton’s backpropagation algorithm could easily be split up into bite-sized chunks. And so training neural networks turned out to be a killer app for CUDA.
According to Witt, Hinton was quick to recognize the potential of CUDA:
In 2009, Hinton’s research group used Nvidia’s CUDA platform to train a neural network to recognize human speech. He was surprised by the quality of the results, which he presented at a conference later that year. He then reached out to Nvidia. “I sent an e-mail saying, ‘Look, I just told a thousand machine-learning researchers they should go and buy Nvidia cards. Can you send me a free one?’ ” Hinton told me. “They said no.”
Despite the snub, Hinton and his graduate students, Alex Krizhevsky and Ilya Sutskever, obtained a pair of Nvidia GTX 580 GPUs for the AlexNet project. Each GPU had 512 execution units, allowing Krizhevsky and Sutskever to train a neural network hundreds of times faster than would be possible with a CPU. This speed allowed them to train a larger model—and to train it on many more training images. And they would need all that extra computing power to tackle the massive ImageNet dataset.
Fei-Fei Li
Fei-Fei Li wasn’t thinking about either neural networks or GPUs as she began a new job as a computer science professor at Princeton in January of 2007. While earning her PhD at Caltech, she had built a dataset called Caltech 101 that had 9,000 images across 101 categories.
That experience had taught her that computer vision algorithms tended to perform better with larger and more diverse training datasets. Not only had Li found her own algorithms performed better when trained on Caltech 101, other researchers started training their models using Li’s dataset and comparing their performance to one another. This turned Caltech 101 into a benchmark for the field of computer vision.
So when she got to Princeton, Li decided to go much bigger. She became obsessed with an estimate by vision scientist Irving Biederman that the average person recognizes roughly 30,000 different kinds of objects. Li started to wonder if it would be possible to build a truly comprehensive image dataset—one that included every kind of object people commonly encounter in the physical world.
A Princeton colleague told Li about WordNet, a massive database that attempted to catalog and organize 140,000 words. Li called her new dataset ImageNet, and she used WordNet as a starting point for choosing categories. She eliminated verbs and adjectives as well as intangible nouns like “truth.” That left a list of 22,000 countable objects, ranging from ambulance to zucchini.
She planned to take the same approach she’d taken with the Caltech 101 dataset: use Google’s image search to find candidate images, then have a human being verify them. For the Caltech 101 dataset, Li had done this herself over the course of a few months. This time she would need more help. She planned to hire dozens of Princeton undergraduates to help her choose and label images.
But even after heavily optimizing the labeling process—for example, pre-downloading candidate images so they’re instantly available for students to review—Li and her graduate student, Jia Deng, calculated it would take more than 18 years to select and label millions of images.
The project was saved when Li learned about Amazon Mechanical Turk, a crowdsourcing platform Amazon had launched a couple of years earlier. Not only was AMT’s international workforce more affordable than Princeton undergraduates, the platform was far more flexible and scalable. Li’s team could hire as many people as they needed, on demand, and pay them only as long as they had work available.
AMT cut the time needed to complete ImageNet down from 18 to two years. Li writes that her lab spent two years “on the knife-edge of our finances” as they struggled to complete the ImageNet project. But they had enough funds to pay three people to look at each of the 14 million images in the final data set.
ImageNet was ready for publication in 2009, and Li submitted it to the Conference on Computer Vision and Pattern Recognition, which was held in Miami that year. Their paper was accepted, but it didn’t get the kind of recognition Li hoped for.
“ImageNet was relegated to a poster session,” Li writes. “This meant that we wouldn’t be presenting our work in a lecture hall to an audience at a predetermined time, but would instead be given space on the conference floor to prop up a large-format print summarizing the project in hopes that passersby might stop and ask questions… After so many years of effort, this just felt anticlimactic.”
To generate public interest, Li turned ImageNet into a competition. Realizing that the full dataset might be too unwieldy to distribute to dozens of contestants, she created a much smaller (but still massive) dataset with 1,000 categories and 1.4 million images.
The first year’s competition in 2010 generated a healthy amount of interest, with 11 teams participating. The winning entry was based on support vector machines. Unfortunately, Li writes, it was “only a slight improvement over cutting-edge work found elsewhere in our field.”
The second year of the ImageNet competition attracted fewer entries than the first. The winning entry in 2011 was another support vector machine, and it just barely improved on the performance of the 2010 winner. Li started to wonder if the critics had been right. Maybe “ImageNet was too much for most algorithms to handle.”
“For two years running, well-worn algorithms had exhibited only incremental gains in capabilities, while true progress seemed all but absent,” Li writes. “If ImageNet was a bet, it was time to start wondering if we’d lost.”
But when Li reluctantly staged the competition a third time in 2012, the results were totally different. Geoff Hinton’s team was the first to submit a model based on a deep neural network. And its top-5 accuracy was 85 percent—10 percentage points better than the 2011 winner.
Li’s initial reaction was incredulity: “Most of us saw the neural network as a dusty artifact encased in glass and protected by velvet ropes.”
“This is proof”
The ImageNet winners were scheduled to be announced at the European Conference on Computer Vision in Florence, Italy. Li, who had a baby at home in California, was planning to skip the event. But when she saw how well AlexNet had done on her dataset, she realized this moment would be too important to miss: “I settled reluctantly on a twenty-hour slog of sleep deprivation and cramped elbow room.”
On an October day in Florence, Alex Krizhevsky presented his results to a standing-room-only crowd of computer vision researchers. Fei-Fei Li was in the audience. So was Yann LeCun.
Cade Metz reports that after the presentation, LeCun stood up and called AlexNet “an unequivocal turning point in the history of computer vision. This is proof.”
The success of AlexNet vindicated Hinton’s faith in neural networks, but it was arguably an even bigger vindication for LeCun.
AlexNet was a convolutional neural network, a type of neural network that LeCun had developed 20 years earlier to recognize handwritten digits on checks. (For more details on how CNNs work, see the in-depth explainer I wrote for Ars Technica in 2018.) Indeed, there were few architectural differences between AlexNet and LeCun’s image recognition networks from the 1990s.
AlexNet was simply far larger. In a 1998 paper, LeCun described a document recognition network with seven layers and 60,000 trainable parameters. AlexNet had eight layers, but these layers had 60 million trainable parameters.
LeCun could not have trained a model that large in the early 1990s because there were no computer chips with as much processing power as a 2012-era GPU. Even if LeCun had managed to build a big enough supercomputer, he would not have had enough images to train it properly. Collecting those images would have been hugely expensive in the years before Google and Amazon Mechanical Turk.
And this is why Fei-Fei Li’s work on ImageNet was so consequential. She didn’t invent convolutional networks or figure out how to make them run efficiently on GPUs. But she provided the training data that large neural networks needed to reach their full potential.
The technology world immediately recognized the importance of AlexNet. Hinton and his students formed a shell company with the goal to be “acquihired” by a big tech company. Within months, Google purchased the company for $44 million. Hinton worked at Google for the next decade while retaining his academic post in Toronto. Ilya Sutskever spent a few years at Google before becoming a cofounder of OpenAI.
AlexNet also made Nvidia GPUs the industry standard for training neural networks. In 2012, the market valued Nvidia at less than $10 billion. Today, Nvidia is one of the most valuable companies in the world, with a market capitalization north of $3 trillion. That high valuation is driven mainly by overwhelming demand for GPUs like the H100 that are optimized for training neural networks.
Sometimes the conventional wisdom is wrong
“That moment was pretty symbolic to the world of AI because three fundamental elements of modern AI converged for the first time,” Li said in a September interview at the Computer History Museum. “The first element was neural networks. The second element was big data, using ImageNet. And the third element was GPU computing.”
Today leading AI labs believe the key to progress in AI is to train huge models on vast data sets. Big technology companies are in such a hurry to build the data centers required to train larger models that they’ve started to lease out entire nuclear power plants to provide the necessary power.
You can view this as a straightforward application of the lessons of AlexNet. But I wonder if we ought to draw the opposite lesson from AlexNet: that it’s a mistake to become too wedded to conventional wisdom.
“Scaling laws” have had a remarkable run in the 12 years since AlexNet, and perhaps we’ll see another generation or two of impressive results as the leading labs scale up their foundation models even more.
But we should be careful not to let the lessons of AlexNet harden into dogma. I think there’s at least a chance that scaling laws will run out of steam in the next few years. And if that happens, we’re going to need a new generation of stubborn nonconformists to notice that the old approach isn’t working and try something different.
Really valuable history, thanks so much.
Regarding the "lesson to draw", I think you have laid it out beautifully. All that remains is to give it its name: **combination**. Per Brian Arthur[1] (and Christopher Alexander, and ...), a new technology is a new combination of existing technologies (which are, recursively, once-new combinations of even older technologies).
As you explain here, it took decades for the separate components to be individually assembled—only then could the final combination take place and produce the enormous leap in usefulness.
Corollary: the conventional wisdom was *right*. The components of AlexNet and its descendants *didn't* work. It was only the emergent combination that did.
[1] https://sites.santafe.edu/~wbarthur/thenatureoftechnology.htm
"So the [AI] boom of the last [12] years was made possible by [three] visionaries who pursued unorthodox ideas in the face of widespread criticism"
This is the crux of the matter and surely generically applies to most all advancements / innovation / discovery that have ever taken place throughout time. You could replace the bracketed bits with virtually anything.
The pertinent question(s) seem to be: "If we continue to (try and) outsource our thinking to the machines / AI as appears to be the wont of the current trajectory of LLMs etc., do we reach a position where we have become devoid of the visionaries and advancement ends?
And/or is it likely that AI / machines can ever become the visionaries instead? Unconventional wisdom depends on the human capacity to 'trust their guts'... something that machines fundamentally can't (currently) do...