Discussion about this post

User's avatar
Performative Bafflement's avatar

Wow, thanks so much for this breakdown, this is fascinating.

So one implication that immediately jumps out at me - this would imply there's a fairly hard ceiling on how smart any NN given a certain data set can be, because if the smoothing function is directly analogous to compression, compression is a hard problem to make substantial leaps of progress in, and there is a floor below which you cannot compress.

Even distillation - it's essentially using a laboriously derived compression schema from a bigger / smarter model to show the smaller model the smoothing function's general path and behavior through the data set's high dimensional space. But if this is true, even distillation has a hard limit, because compression has hard limits. Yes, o3 can look at the same dataset and come up with a much better smoothing function than o1 and GPT-4 and GPT-3, but there's a point where even a nigh-omniscient "GPT o-omega" wouldn't be able to get much more from that same dataset.

So if this is true, it means data walls are really really important, and we'll S-curve out really soon. Because the only way to get smarter is with more data at that point, because even god's own general transformer model couldn't find a better smoothing curve if the data is small enough. But from everything I've heard, the worries about the data wall have kind of gone away?

I wonder why, I'm really curious about the disconnect here.

We know Gwern pointed out that we can sort of "sample from the future smoothing curve" with more inference-time on any current model, but this would imply there's a hard s-curve out there where that will cap out and avail you no more, because you're at the limits of compression / extractable MDL.

I'm also not sure if it's not an issue because frontier models are generally trained on only 10-20 trillion tokens, but Common Crawl alone generates 300tb a month. Sure, you need to clean and sanitize, and I buy that this drops it by 1-2 OOMs - but still seems like a lot of headroom? And then there's Youtube and the like and other multimodal data. So even with Chinchilla scaling laws, we probably still have plenty of fresh data as the models get smarter and more capable?

Or maybe this is the wrong framework to think about data walls and the data available entirely - after all there's so many significant potential overhangs in architecture and methods and more that could result in a big step change in capability from even the same size data (I can think of probably 10), that this isn't even really a map of the territory.

But now you've made me doubt that those overhangs could really drive a step change by bringing this up - because compression algorithms are generally pretty efficient, and you wouldn't expect big step changes in compressibility.

Definitely food for thought, and I'm noticing big gaps in my "Frontier AI space" mental model, so I really appreciate that you posted this.

Expand full comment
1 more comment...

No posts