OpenAI made waves over the last few days because of leaked-then-confirmed news of a $200 per month GPT pro mode. A few folks have taken this as (further) confirmation that the scaling hypothesis is dead, AGI is over hyped, there's no threat to human/no need to worry about AI safety, and, of course, these things were never that big a deal to begin with. Surely the increasing focus on commercialization is indicative?
I think some of these people may have an axe to grind, and are maybe looking for confirmation where they can find it. Still, more than one person has asked me in recent days about the "AI stagnation". And I've seen Erik Hoel's thinkpiece floating around a few places too. The debate about whether AGI can be achieved with deep neural networks is an interesting and profoundly important one in the year 2024, and it's worth digging into1.
The Economics of LLMs
I think that the LLM business and technical model looks a lot like the search business and technical model.
Globally, there are about 4 search engines that matter — Google, Bing, Baidu, and Yandex. But really, that's not true, Google is the only one that really matters. Bing was, for a long time, a Microsoft vanity project2. And Baidu and Yandex only exist because of national security interests — the CCP really doesn't want its citizens to check notes talk about Winnie the Pooh.
There's a few niche search engines too. WestLaw and Lexus and JSTOR. But — with apologies to any lawyers — these also don't matter. The total market cap of Google is approximately a gazillion times that of these three combined.
This shouldn't be a surprise to anyone, everyone knows Google is big and search is big. But there's some ways in which this market outcome is a bit odd.
For one thing, building a search engine is surprisingly easy. A lot of people learn how to do it as part of a distributed systems course in undergrad. And tools like OpenSearch and ElasticSearch provide comprehensive state of the art distributed systems search tooling, such that anyone who wants to can spin up their own search index can do so pretty quick. So why aren’t there more scrappy search startups?
The big issue is that building a search index of the entire Internet is really capital intensive. You need servers. You need a lot of data storage. You need data centers all over the world. And that's before you get into the brand recognition, the complex ecosystem of related tools and services, the massive amounts of money Google has spent on making their engine the default of everything. The startup cost to compete against Google is massive.
Normally, when an upstart company wants to take a bite out of a big established market, they go low on price. This is especially true in a setting where the underlying product isn't sticky. All things being equal, consumers will go for the best product along some cost/value curve. Some consumers may pay more for premium services; others less for cheaper services. That segmentation allows for competition among many smaller players.
But search is free. There is no price competition, because in the status quo consumers aren't paying anything3. So the curve collapses into a line, the only thing that matters is whichever search engine is the best one.
This leads to massive economies of scale and winner-take-all market dynamics. Google is the most popular search engine because it's the biggest and best one, which gives it all the money to continue being the biggest and best one. If a competitor managed to become bigger and better, it may flip the market4, but good luck doing that out of nowhere.
Going back to LLMs, I think you see roughly the same market dynamics. LLMs are pretty easy to make, lots of people know how to do it — you learn how in any CS program worth a damn. But there are massive economies of scale (GPUs, data access) that make it hard for newcomers to compete, and using an LLM is effectively free5 so consumers have no stickiness and will always go for the best option. You may eventually see one or two niche LLM providers, like our LexusNexus above. But for the average person these don't matter at all; the big money is in becoming the LLM layer of the Internet.
And, as expected, there's only a handful of LLMs that matter. There's a Chinese one and a European one, both backed by state interests. And there's Google, Anthropic, and OpenAI6 all jockeying for the critical "foundation layer" spot.
The economics of LLMs means that it is critical for these players to have the best models. There's no room for second place.
Scaling Hypothesis
Much like a search index, LLMs become exponentially more valuable the bigger they are. The difference between GPT2 and GPT3 is substantial — the former was a neat research toy, the latter a fundamental shift for millions of people.
How much does scale matter? Let's look at some examples.
GPT2 has 1.5 billion parameters, was trained on ~15 billion tokens7, with 256 GPUs.
GPT3 has 175 billion parameters, was trained on ~500 billion tokens, with 1024 GPUs (~100x bigger).
GPT4 has 1.7 trillion parameters, was trained on 13 trillion tokens, with 25000 GPUs (~10x bigger).
Hopefully you're seeing a pattern. Very roughly, LLM benchmark performance follows a scaling law that's a function of compute and data — exponential improvements in the number of flops (measured as the size of the model or number of training steps) and in number of "tokens" leads to linear improvements in model performance. This pattern has been tracked across many different model sizes, across nearly 7 orders of magnitude. And, anecdotally, linear improvements in model benchmark performance result in step-function improvements in real world model ability. Again, GPT2 to GPT3 was a meaningful shift. Yes, GPT3 is strictly better at generating text. But it's also picked up a bunch of other things, like being able to play chess or do basic math.
Very roughly, we ought to expect GPT5 to be trained on 250000 GPUs and 130 trillion "tokens"8. So it turns out we have a very good answer as to why we haven't seen GPT5 yet.
Everyone is compute capped
The history of the scaling hypothesis has been:
models hit some kind of bottleneck to scale;
a bunch of people come out of the woodwork to claim that this bottleneck proves the scaling hypothesis is dead;
the bottleneck is overcome (resulting in a generation of better models);
the cycle repeats.
Up until recently, those bottlenecks were mostly limitations of AI as a field. We needed to figure out regularization (MDL), and GPU parallelization (Alexnet), and a flop-efficient architecture (Transformers), and training at scale (GPIPE). But it wasn’t entirely just AI research — we also needed to figure out how to get a lot of data and a lot of compute power, and those advances mostly happened independent of deep learning through the Internet and the cloud computing revolution. In fact, I think a not quite correct history of deep learning is that the (first?) AI winter happened precisely when the scaling hypothesis was blocked by external factors that hadn't caught up. It was just that at the time, everyone (except Hinton) assumed the methods were the problem.
Anyway, we are currently in the middle of the cycle — the latest generation of models are hitting roadblocks. But the bottleneck is decidedly not an AI one. In 2023 NVIDIA shipped roughly 3.75 million data center GPU units to the US. Assuming our scaling laws hold, GPT5 requires ~7% of all high end GPUs in the country9. If you continue down this line of reasoning, eventually the pipeline becomes energy capped as well.
All of the companies mentioned above are jockeying for as many chips as they can get their hands on, in some cases pre-committing to massive shipments, which in turn prevents the other companies from developing a commanding lead. It's a bit of a race to the bottom, the real winner is the single company that everyone is buying chips from. No wonder NVIDIA's stock price is where it is.
The state of AI
But is AI stagnating?
There is a strict sense in which consumer AI may not feel like it's growing at the same rate as it did from 2020 to 2023. That period was a particularly magic time where we had a surplus of chips that we had to catch up to. Like a gas expanding to fill a volume, our chip utilization has caught up, so releases may not be at such a rapid clip.
But in a deeper sense, the scaling laws still feel ironclad. OpenAI, Anthropic, and Meta are all investing in massive GPU superclusters. Google is too, though primarily relying on their internal TPU chipsets. Google and Microsoft are simultaneously investing in cheaper energy sources — it shouldn't be lost on anyone that both have bought access to nuclear energy — while OpenAI plans on creating its own chip designs. If you follow the money, it seems pretty clear that there's widespread belief that there's more juice left to squeeze from scale.
And much like Moore's law, there's a lot of ways in which scale can be achieved outside of raw chip quantity. Chips get better. Energy gets cheaper. We discover a new, even more efficient model architecture. We develop ways to train simultaneously on a collection of GPUs, CPUs, ASICS. Etc. People are doing research on all of these things, too.
Some of the problem here is that consumers are just getting impatient. The first version of GPT3 was published in May, 2020. GPT4 was launched in March, 2023. That’s 34 months. It’s only been ~20 months since GPT4 was released, there’s a bit more time to go before OpenAI starts ‘falling behind schedule’. We haven’t had the capability to even create large enough GPU clusters until recently. And it is also plausible that the release of stronger LLMs tracks more to self driving cars than to iPhones. The hypecycle for self driving cars was at its peak around 2014-2015. Even though the technology wasn’t quite consumer ready by then, the estimated ‘release date’ was still within only a few short years. In 2024 there are readily available self driving cars in several cities. From a research perspective, the folks saying that self driving cars would be ready within a few years of 2014 were more right than those saying it would never be ready at all.
So I don't think AI is stagnating in the technical sense. I think there's a meaningful difference between "this is impossible" and "this is possible given what we know about the world, but hard to do". The former is research, the latter is engineering. And we know how to do engineering. When folks like Ilya or Andreeson talk about plateaus, I think they mean that we've run out of what we can do with our current set of chips. More chips, on the other hand, remain an exciting frontier.
As for the people who are arguing that AI is obviously dead and the whole field was doomed to failure because it's "just statistics" or "just linear algebra", idk, this feels a lot like shifting goal posts. Standard LLMs are exposed to way less data than the average human baby, the fact that they can do anything at all is a miracle, the fact that they can regularly pass competence tests like the SAT or the Bar should be endlessly awe inspiring. For some reason people keep wondering when we'll have AGI, even though it's literally here and accessible through a web browser10. In any case, the cope isn't going to stop the AI from taking everyone's jobs (mine included).
If I had to bet on anyone here, it would be Google11. They have strong vertical integration across their training stack, from chip design up to JAX. That, combined with their lack of reliance on NVIDIA, means they have access to better, cheaper, more optimized compute per flop in addition to the usual chips everyone is competing for. So far we haven't talked much about data, because the consensus is that we have way more data than what we can train on12. But Google wins there too — though they've been very careful to avoid training on personal or even public-but-not-open-licensed sources, they very easily could start playing the game the way OpenAI has been when it comes to respecting IP. And even though Google hasn't opened the full data hose yet, one suspects that if they weren't under so much federal scrutiny they may have done so already.
Still, exponential increases in computation aren't sustainable forever. Even Google, with all its extra compute, is feeling the pinch. My understanding is that they've basically entirely stopped hiring, even for AI research roles. Not because they don't need the talent, but rather because even if they hired that talent, those engineers wouldn't be able to get allocation to run jobs. Every spare machine is going towards feeding the beast.
And it's worth noting that in some macro sense, these scaling laws aren’t sustainable. Exponential curves rarely are. Even if we manage to get to GPT5, GPT6 would require a sizable chunk of all GPUs available today, not to mention incredible energy usage. So the state of AI today is, you have these huge companies, all desperately trying to get the top spot, all stuck sitting on their hands waiting for the foundries to move. What do?
Back to GPT Pro
Well, if you are literally unable to train bigger models to capture a bigger slice of the pie, it might make sense to increase the overall size of the pie while waiting for your chip shipment to arrive.
Anthropic, OpenAI, and Google have all been busy finding new ways to integrate and distribute their models, while supporting ecosystems of startups (both implicitly and explicitly) that depend on their models as core infrastructure. That includes things like building API support, creating new 'products' like OpenAI canvas or Anthropic artifacts, finding ways to improve your model performance WITHOUT scaling (eg "chain of thought" reasoning) and, yes, developing and bundling premium features like GPT Pro. Would I pay $200 for Chat? No, of course not. But I'm not the target audience, and neither are you. There are definitely some powerusers who would, especially if the Pro version offers (the appearance of) abilities like not having to hire a lawyer for basic contract work13.
In any case, none of these are indications that theoretical scaling laws are dead. I'll believe it when someone manages to put together a quarter million GPU cluster and still can't get good performance out of it.
Some of what I discuss here is also related to my paper review series — check that out if you're interested in building a more technical intuition!
Arguably still is, though it seems they are leveraging Bing and the Bing brand as the entry point for AI integrations in the broader Microsoft ecosystem.
Not directly, not in a way they can reason about. There are second order effects, some of which the DOJ cares a lot about, but these aren't legible to the average person.
The way Google itself did the first time around against Yahoo!
The cost per token has dropped significantly over the last two years, and several competent LLMs are either totally free — like ChatGPT 3.5 — or open source — like Llama.
Arguably also Meta, though their open source play is a bit different.
I don't have the exact number. Publicly, it was trained on 8 million web pages, and I assume on average each has 2k tokens.
Very likely that we would have to tap into other sources of data beyond pure text. I'm using token as an abstraction unit for training data, you could absolutely "tokenize" images or video.
I'm being a bit handwavy here, because we really should be measuring FLOPS instead of GPUs. A high end GPU today is much more powerful than a high end GPU from 2019.
There's a bit of semantics debate happening here — AGI isn't firmly defined, and I think some people are looking for "consciousness", whatever that means. For me, I go with the definitions of the words. ChatGPT et al are artificial, and they are generally intelligent. By the latter, I mean that you can throw just about any task at it, and it will do reasonably well, probably above average human level. Thus, AGI.
Disclaimer: I was at Google for about 3 years.
We’ve more or less run out of internet to crawl, but we haven’t come close to working through the backlog of image, video, and audio data.
I could definitely see some startup founders doing this. And why shouldn’t they? Most of the legal stuff startups need to do is rote, but critical. If an LLM can just do it, you make your money’s worth for an entire year easily.