Google's ability to control its own destiny means it can pursue technical strategies that are simply out of scope for its competitors. This is immediately obvious if you look at the model specs for Gemini against every other flagship model.
Just wanted to say thanks for writing this - Google recently opened 2.5 to the web interface, so now you don't need API calls, and I've been using it thanks to this post.
They *cooked.* It is significantly smarter, more capable, and less error-prone than o1 Pro or Claude 3.7, in my own opinion. It goes deeper in detailed ways, and I haven't run across Gellman Amnesia once in any subject I know deeply. It's been able to go deeper than my knowledge in those areas too, which is a first - and when I double checked, it was right and wasn't hallucinating.
Another advantage - I like to "adversarially collaborate" and test my ideas and arguments, and o1 Pro and Claude 3.7 and all the other models really suck for this - they immediately roll over at the tiniest pushback.
But 2.5 doesn't do this, stakes out a consistent position and maintains it over time, but is amenable to factual correction or rebuttals (but not vibes based ones!) - it's so much smarter than every other model right now, I've made it my daily driver.
And all thanks to this post! I don't think I would have tried it if I hadn't seen your post and Zvi's talking about it. Anyone else reading this, if you haven't tried it, it's available at the Gemini web interface for free - you might be pleasantly surprised, like I was.
I'm glad! I will say that subjectively I feel like it got a lot dumber in the last 48 hours, and I'm not sure why. I was using it for code gen through the API, so it's plausibly a different experience than through the studio 🤔
> But I am pretty certain OpenAI pioneered this direction precisely because they were feeling the pinch of their compute limitations.
This is interesting, though it seems like OpenAI and Anthropic are still investing in larger model runs (ChatGPT 4.5 and 3.5 Opus) and it seems like pre-training returns are just diminishing (but still there). If the claim is "Google can scale pre-training more because they have the most compute power", that feels dependent on scaling pre-training still giving good returns? And sure, 2.5 Pro cooked, but it's hard to tell how much of that is because of test-time compute and how much is from scaling pre-training.
> And second, and maybe most importantly, because those same employees are now stuck at the company until a liquidity event (an IPO or an exit or some round of financing) which significantly limits optionality.
Why are they stuck at the company exactly? Because they'd have to exercise & pay taxes on gains if they leave?
> "Google can scale pre-training more because they have the most compute power", that feels dependent on scaling pre-training still giving good returns
This is absolutely true, but also we always knew that there would be diminishing returns on scale. The scaling laws are all power laws, which means you need exponential increases in resources for linear improvements in ability. For a long time, none of the big players actually had the capacity to 10x their previous training runs. GPT4 was trained on 20k GPUs; there were NO datacenters that had 200k GPUs until very recently. So I'm curious to see what the next generation batch of models looks like.
But there are other ways to improve models that aren't just related to pretraining -- context window size, effective context window, inference speed, these are all important as well. Part of the reason I harp on context window is precisely because it is a form of scale that is generally not thought of in our usual pretraining metrics, and yet obviously has meaningful impact in the kinds of results we get out of these models.
> Why are they stuck at the company exactly?
Yea, some form of golden handcuffs. Anecdotally, friends at OAI say that they have a bunch of options and get some liquidity at tender events, but OAI chooses who gets to sell and how much they get to sell, often prioritizing the needs of current employees over previous ones. So even if I pass my cliff and get 25% of my option grants, I can't actually sell those options until a tender event occurs.
Anthropic may be even more restrictive, since I don't think they do regular tenders.
With Google, I can get the stock and sell it immediately.
> More generally, I am skeptical of Grok, not least because of the tendency of xAI's owner to exaggerate.
I’m surprised that as a person who writes blog posts about AI, you didn’t try it. I use it as my daily driver for question answering, alongside Claude 3.5 for more focused generative tasks
Afaik it hasn't been tested on independent benchmarks, though it supposedly performed well on chat arena. But there's too many models to try all of them, I personally didn't know anyone using grok (and I know a lot of people in AI), and I have personal reasons for refusing to use grok that you can get the general tenor of if you read anything in this blog tagged #politics
Great post - one thing you didn't specifically go into is how crazy efficient TSP's and TPU's are relative to GPU's - it's something like a 2-8x buff on "inference per watt," which is a huge deal.
I think you did a great job lining up Google's advantages, which are many. I'm still conceptually-but-not-literally short on Google overall in the AI game, though - yes, they have a ton of advantages, but they have proved repeatedly that they are capable of snatching defeat from the jaws of victory due to cultural and execution problems.
I can't think of a bigger example of mismatch between "human capital and talent in employees" versus "quality of products," for either consumers or customers (advertisers).
Yea that's definitely true. Even now just getting to the better Gemini models requires going through gcp, which may frustrate and scare off a lot of users. Anecdotally gcp shut down our API access for a day for supposed ToS violations, which on review was rescinded. So it's just a lot more friction
So I'm going to replace 'rlhf' with 'post training' in my response, because rlhf is one method among many to do post training.
First, just to quickly explain what post training even means. Large language model training is split into two pieces, pre and post training. Pre training is 'predict the next token based on the previous tokens'. So I have the sentence "the quick brown fox" the model should output "jumped". Pretraining is done over vast corpuses of data.
Once you are done with pretraining, you will have a model that has learned 'language statistics'. But most of these models do not really do what the user wants. For example, if I put into a pretrained model the sentence "Ten questions on US history" the model will NOT give me ten questions on US history. Rather, it will list other such statements, like "Ten questions on geography
Ten questions on literature
Ten questions on science"
The reason it does this is again because it has only learned to output the next-most-likely-word.
The goal of post training is to get to a model that interacts with the user in some desired way. Generally, this includes RLHF, but it can also include other kinds of fine tuning (e.g. for style, tool use, or instruction following) and model optimization (e.g. distillation, quantization). Though RLHF does traditionally use RL, it's more useful to think of it as replacement to _collecting_ a bunch of data with which you would ideally do fine tuning.
---
Ok that all out of the way:
I think Goog retains the advantages when it comes to post training generally and RLHF specifically. Most of this is just from first principles -- if post training benefits from really high quality data sampling, Google has the most resources to throw around to get the best data samples. But all of the post training stuff falls into a similar bucket as the inference-time-compute stuff pioneered by OpenAI, in that it's fairly easy to copy. All of the big models do some kind of RLHF (though I think that's actually fallen out of popularity a bit in favor of highly selected-for reasoning trace samples, but don't quote me on that)
I'm not sure what you mean by your second question. Taking it as written, I think you have it backwards a bit. The architecture comes first, and the RLHF (and post training in general) second. But they really are sort of independent things, it's not clear to me that they have any real relationship, like I don't think that people are constructing models that are optimized for particular post training regimes and vice versa. Could be wrong about this.
Re scaling laws, it depends. The scaling law papers that I'm familiar with mostly evaluate pretrained models, not post trained ones. There is a body of research that shows that you can go past the results predicted by existing scaling laws with small increases in compute or by using transfer learning. I'm not sure if there is anyone who has straight up calculated the scaling curves and exponents for post trained models. That said, my guess is that the scaling laws would still mostly hold in shape (i.e. some kind of power law scaling curve); what would change is the exponent values.
Just wanted to say thanks for writing this - Google recently opened 2.5 to the web interface, so now you don't need API calls, and I've been using it thanks to this post.
They *cooked.* It is significantly smarter, more capable, and less error-prone than o1 Pro or Claude 3.7, in my own opinion. It goes deeper in detailed ways, and I haven't run across Gellman Amnesia once in any subject I know deeply. It's been able to go deeper than my knowledge in those areas too, which is a first - and when I double checked, it was right and wasn't hallucinating.
Another advantage - I like to "adversarially collaborate" and test my ideas and arguments, and o1 Pro and Claude 3.7 and all the other models really suck for this - they immediately roll over at the tiniest pushback.
But 2.5 doesn't do this, stakes out a consistent position and maintains it over time, but is amenable to factual correction or rebuttals (but not vibes based ones!) - it's so much smarter than every other model right now, I've made it my daily driver.
And all thanks to this post! I don't think I would have tried it if I hadn't seen your post and Zvi's talking about it. Anyone else reading this, if you haven't tried it, it's available at the Gemini web interface for free - you might be pleasantly surprised, like I was.
I'm glad! I will say that subjectively I feel like it got a lot dumber in the last 48 hours, and I'm not sure why. I was using it for code gen through the API, so it's plausibly a different experience than through the studio 🤔
This is interesting, thanks for the post!
> But I am pretty certain OpenAI pioneered this direction precisely because they were feeling the pinch of their compute limitations.
This is interesting, though it seems like OpenAI and Anthropic are still investing in larger model runs (ChatGPT 4.5 and 3.5 Opus) and it seems like pre-training returns are just diminishing (but still there). If the claim is "Google can scale pre-training more because they have the most compute power", that feels dependent on scaling pre-training still giving good returns? And sure, 2.5 Pro cooked, but it's hard to tell how much of that is because of test-time compute and how much is from scaling pre-training.
> And second, and maybe most importantly, because those same employees are now stuck at the company until a liquidity event (an IPO or an exit or some round of financing) which significantly limits optionality.
Why are they stuck at the company exactly? Because they'd have to exercise & pay taxes on gains if they leave?
> "Google can scale pre-training more because they have the most compute power", that feels dependent on scaling pre-training still giving good returns
This is absolutely true, but also we always knew that there would be diminishing returns on scale. The scaling laws are all power laws, which means you need exponential increases in resources for linear improvements in ability. For a long time, none of the big players actually had the capacity to 10x their previous training runs. GPT4 was trained on 20k GPUs; there were NO datacenters that had 200k GPUs until very recently. So I'm curious to see what the next generation batch of models looks like.
But there are other ways to improve models that aren't just related to pretraining -- context window size, effective context window, inference speed, these are all important as well. Part of the reason I harp on context window is precisely because it is a form of scale that is generally not thought of in our usual pretraining metrics, and yet obviously has meaningful impact in the kinds of results we get out of these models.
> Why are they stuck at the company exactly?
Yea, some form of golden handcuffs. Anecdotally, friends at OAI say that they have a bunch of options and get some liquidity at tender events, but OAI chooses who gets to sell and how much they get to sell, often prioritizing the needs of current employees over previous ones. So even if I pass my cliff and get 25% of my option grants, I can't actually sell those options until a tender event occurs.
Anthropic may be even more restrictive, since I don't think they do regular tenders.
With Google, I can get the stock and sell it immediately.
> More generally, I am skeptical of Grok, not least because of the tendency of xAI's owner to exaggerate.
I’m surprised that as a person who writes blog posts about AI, you didn’t try it. I use it as my daily driver for question answering, alongside Claude 3.5 for more focused generative tasks
Afaik it hasn't been tested on independent benchmarks, though it supposedly performed well on chat arena. But there's too many models to try all of them, I personally didn't know anyone using grok (and I know a lot of people in AI), and I have personal reasons for refusing to use grok that you can get the general tenor of if you read anything in this blog tagged #politics
Great post - one thing you didn't specifically go into is how crazy efficient TSP's and TPU's are relative to GPU's - it's something like a 2-8x buff on "inference per watt," which is a huge deal.
I think you did a great job lining up Google's advantages, which are many. I'm still conceptually-but-not-literally short on Google overall in the AI game, though - yes, they have a ton of advantages, but they have proved repeatedly that they are capable of snatching defeat from the jaws of victory due to cultural and execution problems.
I can't think of a bigger example of mismatch between "human capital and talent in employees" versus "quality of products," for either consumers or customers (advertisers).
Yea that's definitely true. Even now just getting to the better Gemini models requires going through gcp, which may frustrate and scare off a lot of users. Anecdotally gcp shut down our API access for a day for supposed ToS violations, which on review was rescinded. So it's just a lot more friction
So I'm going to replace 'rlhf' with 'post training' in my response, because rlhf is one method among many to do post training.
First, just to quickly explain what post training even means. Large language model training is split into two pieces, pre and post training. Pre training is 'predict the next token based on the previous tokens'. So I have the sentence "the quick brown fox" the model should output "jumped". Pretraining is done over vast corpuses of data.
Once you are done with pretraining, you will have a model that has learned 'language statistics'. But most of these models do not really do what the user wants. For example, if I put into a pretrained model the sentence "Ten questions on US history" the model will NOT give me ten questions on US history. Rather, it will list other such statements, like "Ten questions on geography
Ten questions on literature
Ten questions on science"
The reason it does this is again because it has only learned to output the next-most-likely-word.
The goal of post training is to get to a model that interacts with the user in some desired way. Generally, this includes RLHF, but it can also include other kinds of fine tuning (e.g. for style, tool use, or instruction following) and model optimization (e.g. distillation, quantization). Though RLHF does traditionally use RL, it's more useful to think of it as replacement to _collecting_ a bunch of data with which you would ideally do fine tuning.
---
Ok that all out of the way:
I think Goog retains the advantages when it comes to post training generally and RLHF specifically. Most of this is just from first principles -- if post training benefits from really high quality data sampling, Google has the most resources to throw around to get the best data samples. But all of the post training stuff falls into a similar bucket as the inference-time-compute stuff pioneered by OpenAI, in that it's fairly easy to copy. All of the big models do some kind of RLHF (though I think that's actually fallen out of popularity a bit in favor of highly selected-for reasoning trace samples, but don't quote me on that)
I'm not sure what you mean by your second question. Taking it as written, I think you have it backwards a bit. The architecture comes first, and the RLHF (and post training in general) second. But they really are sort of independent things, it's not clear to me that they have any real relationship, like I don't think that people are constructing models that are optimized for particular post training regimes and vice versa. Could be wrong about this.
Re scaling laws, it depends. The scaling law papers that I'm familiar with mostly evaluate pretrained models, not post trained ones. There is a body of research that shows that you can go past the results predicted by existing scaling laws with small increases in compute or by using transfer learning. I'm not sure if there is anyone who has straight up calculated the scaling curves and exponents for post trained models. That said, my guess is that the scaling laws would still mostly hold in shape (i.e. some kind of power law scaling curve); what would change is the exponent values.