12 Grams of Carbon

a post that i haven't written yet is trying to think about the amount of bits that a neural network architecture adds / how much entropy is reduced by picking a specific neural network architecture. You could model a neural network as a 'random variable' that has the ability to learn any of a wide range of functions, and you can think of the architecture as a way of narrowing down the space of possible functions that could be learned. That means in theory there should be some way to quantify the entropy reduction that a given architecture provides...

Expand full comment

Ani N

In a very real sense there is! For pretrained models, we embed language into a probability distribution, and thus “compress” it. Perplexity (or tokenizer agnostic, bits per byte) is a measure of entropy reduction from a simple language encoding (of the tokenizer) to a rich neural one

Expand full comment

Reply (2)

do you have any links to information about perplexity?

Expand full comment

https://lilianweng.github.io/posts/2017-10-15-word-embedding/

Ani N

May 27

Oops didn’t see this. The best textbook is cover and thomas. Good blog mathy blog post:

Expand full comment

Ani N

This is distinct from what you are saying about model architectures, to be clear, but i think a useful intuition for embeddings

Expand full comment

Gunflint

May 28Edited

I’m disappointed that you didn’t use the old gag that a topologist can’t distinguish between his donut and his coffee cup.

Expand full comment

May 28

another famous one, though more recent: ml researchers cant distinguish between correlation and causation

Expand full comment

Luke

A nit of mine: the word "manifold" would be better replaced by "subspace". There's no reason why the set of data has to have the necessary properties to be a manifold (ie locally Euclidean and constant dimension). And machine learning techniques do not require the properties of a manifold either. So I think subspace is the better term to use.

Expand full comment

I think in practice we actually do observe manifolds, i.e. locally euclidean behavior within that fixed dimensional space. A given intermediate layer of a neural network has a fixed dimensionality (e.g. an N-d embedding vector), and embeddings in the local region of a given point behave like euclidean vectors (thus the word2vec example). Subspace is strictly more general, but I think neural networks try to approximate manifolds in the technical sense. I may be misunderstanding the terms, I don't have a formal background in this area of mathematics so would appreciate clarity

Expand full comment

Luke

I can give some examples where the manifold conditions fail.

First, consider the classification problem where we want to identify images containing nothing but one or two points. The images representing one point occupy a 2-dimensional subspace, given by the x- and y-coordinates of the point. The images representing two points occupy a 4-dimensional subspace, given by the pair of x- and y-coordinates. The union is therefore not a manifold, because the dimensionality is not constant. (There's some technical stuff with how the image is discretized into pixels, but the general idea is correct.) Obviously this a contrived example, but it's not clear to me that real-world classification problems (e.g. "images of cats") would not also have varying dimensionality.

There are also examples where the embedding in latent space is not a manifold. The simplest example---but very pedantic---is that the output of a single ReLU is not a manifold. That is, y=max(0,x) has an image of [0,inf), and this is not a manifold because it's not locally Euclidean at 0. However, I say this is overly pedantic because this is a manifold-with-corners, so abbreviating to "manifold" is not so bad.

However, one can construct a more complicated example that fails to be any sort of manifold. Suppose our NN inputs are x and y, and they are in the range [0,1]. Suppose the NN outputs are

A = x+y

B = max(0.5, y)

(These equations can be produced with a network of ReLUs.) Then (A,B) defines a set which is not a manifold, because it has a one-dimensional component and a two-dimensional component.

Anyway, I hope this helps understand where I'm coming from. I realize that the term "manifold" is pervasive in ML literature, but for someone with a math background, the term "subspace" seems much more appropriate.

Expand full comment

Ah I see. Thank you for the examples!

I think the first example still wouldn't be quite right -- yes, it is true what you say about how you *could* model the images, but in practice a neural network is a fixed dimensionality construct so it doesnt model them that way. If we had a set of data where some of the data is 1 pixel, and some data is 2 pixels, we would traditionally artificially pad out the 1 pixel data to 2 pixels! So the model actually only ever sees a dataset of 2 pixel inputs, and it just learns a special value which can apply to one of the two pixels that semantically means 'padded out'. This sort of behavior appears all over.

Not sure if I understood the example.

The second set of examples is much more clear. Yes, I think everything you are saying is correct here, but also I don't know that I would call the outputs of the contrived examples 'embeddings' as generally understood in the literature. Maybe the additional point of clarity would be something like "embeddings live in subspaces which behave like manifolds in the areas where the subspace has any semantically coherent values". To be more specific, it really does seem like embeddings behave as linear vectors, i.e. euclidean geometry math seems to work in places where those embeddings actually correspond to real features (see Toy Models of Superposition, or https://transformer-circuits.pub/2024/july-update/index.html#linear-representations for a shorter treatment).

Expand full comment

Ayorai

eu observei todos os comentários e cheguei a conclusão que ambos os pontos de vista são válidos e se complementam.

Para uma rigorosidade matemática, "subespaço" é o termo mais preciso e geral.

Para descrever o comportamento prático e a intuição geométrica em certas regiões dos espaços latentes, "variedade" (mesmo que com suas "curvas e cantos") pode ser um termo útil, embora não estritamente correto em todos os casos.

mas a importância de entender as limitações e as nuances dos termos que usamos, especialmente em um campo interdisciplinar como o aprendizado de máquina, onde a matemática e a computação se encontram ótimas observações até mais.

Expand full comment

> Their approach is to create quantifiable heuristics for 'good' reasoning. For example, we can come up with unit tests or math problems for the AI to answer.

One thing I'm curious about here - in past lives, I've done a lot of data science and modeling.

One big thing LLM's should now allow you to do is define "soft targets" on unlabeled data, because now you have a mind in the loop that's basically human capable in terms of defining things like "positive valence" or "likely to be a net promoter" or "was persuaded" or "has positive opinion of X" and many other things.

If I were still doing any modeling, I'd definitely be experimenting with modeling towards LLM-labeled soft outcomes like this, because it's a unique capability that can probably drive a good amount of lift, depending on your domain. Do you have any knowledge or thoughts on this front? Like this should be a big deal already for companies - back in the data science heyday between 2012 and 2018 or so, we were driving tens of millions of value per year as just one team.

I think this is also how you can use inference time to use "current smartest model" explorations of an answer space to create breadcrumbs for the next smartest model, as Gwern has talked about before,* and as you mention in the post.

But to my knowledge, this only works on labeled trues like math or code, and so forth. It sounds like you're saying that when you do that, the model gets smarter enough it's also better at a bunch of soft targets like therapy, persuasiveness, storytelling, etc, just as a happy accidental bonus (because ultimately, everything is connected to everything?).

Can you confirm / deny that interpretation?

What about bread-crumbing directly towards soft targets in the same way? Are we limited because you need a "one-gen-smarter" model or human in the loop to get good "trues?" Because I can see it going both ways there, both decaying into an increasingly noisy and self-referential attractor, or an overall direction of truth that it can perceive, especially if you unleashed it in some outside-labeled or competitive domain (like the paper on the persuade me subreddit, or something like a therapy bot getting confirmations that they've really helped from actual human interactors, and so on).

I'm trying to think about the edges and limits of softerf LLM-labeled outcomes, more or less, so if you (or anyone else here) has thoughts on those, I'd appreciate it.

________________________________________

* "Every problem that an o1 solves is now a training data point for an o3 (eg. any o1 session which finally stumbles into the right answer can be refined to drop the dead ends and produce a clean transcript to train a more refined intuition). As Noam Brown likes to point out, the scaling laws imply that if you can search effectively with a NN for even a relatively short time, you can get performance on par with a model hundreds or thousands of times larger; and wouldn't it be nice to be able to train on data generated by an advanced model from the future? Sounds like good training data to have!"

Comment here: https://www.lesswrong.com/posts/HiTjDZyWdLEGCDzqu/?commentId=MPNF8uSsi9mvZLxqz

Expand full comment

> Can you confirm / deny that interpretation?

Based on the deepseek paper, that seems roughly correct. If reasoning is on a manifold, its plausible there is a part of the reasoning manifold where you are really good at math but shit at everything else, but it seems like that is a much harder space to get to than just being good at reasoning all around. But, you know, very handwavy, its not at all obvious why that should be the case a priori.

Re: softer targets, I'm not sure. I know there are a few startups doing this, which means there are definitely bigger companies doing this, but idk if anyone has cracked it. I think part of the reason this is hard is because, well, LLMs aren't actually all that good at this in a consistent way *at the boundaries*, which is where it would matter the most. Especially in an RL regime, where bad rewards can be really destabilizing.

This is actually something I've been chewing on for a while -- there is an asymptote that exists at the boundary of human cognition, and it is not at all obvious to me how we get over that hump because we rapidly lose the ability to *evaluate* the models. Like, the deepseek paper had one model that was purely RL trained, and it came up with a weird mixed language to do its reasoning in. Which, ofc, makes sense -- why would an ASI think in *english*? But then how do we evaluate whether such a model is worth anything at all? More generally, you can continually improve your reasoning traces as long as you can evaluate which of two reasoning traces is better. But once you can't distinguish them any longer, what happens? It's a chain of though that has given me pause about the predictions of FOOM.

RL is really hard to optimize in general. Like, you can get everything right -- the data is correct, the architecture is good, the environment settings are right -- and the model will still just totally fail half the time. You need to be able to train a lot of these things, you need some kind of self learning to generate a lot of reward gradients, etc. etc. I think without that last part it becomes kinda tricky.

One way forward that I think is plausibly interesting is some research my friend published here: https://huggingface.co/llm-council

You could imagine that instead of having a single LLM in the loop, you have a 'council' of many LLMs and several humans that all vote on the best outcomes, and over time it gets better and the humans sorta phase out. IDK if that would *actually* work, but maybe wisdom of the crowds is sufficient?

Expand full comment

Really love the "wisdom of crowds" idea, I think that's a great method for soft targets, and probably much else.

And yeah, I know RL is really finicky and high touch. I'm actually still not sure why we can't do Gwern "breadcrumb" style reasoning traces coupled with gradient descent, because I thought the big gap between RL and gradient descent was that there's no verifiable targets in the middle between start and outcome, but RL gets you those thinking traces in the middle. But if we're generating the probabilistic traces anyways, why aren't you just gradient descenting those then, instead of dealing with finicky RL methods?

Or, if you're afraid of local optima, do something like MCMC around the reasoning traces to explore the space further around each intermediate step, with an evaluation of each chain of reasoning produced using whatever "credit assignment" methods they're using already?

It's probably in the weeds, and I'm sure the break is probably strongly RL favorable given it's what everyone does, but still seems like there's some lift in going simpler / less finicky where you can.

> More generally, you can continually improve your reasoning traces as long as you can evaluate which of two reasoning traces is better. But once you can't distinguish them any longer, what happens? It's a chain of though that has given me pause about the predictions of FOOM.

Here my mind immediately goes to something like a model of holdouts, predictions, and x-fold cross validation on the data sets available on those various soft targets.

Ultimately you get your network of meaning with your language data sets, but there's further layers of information there also, beyond contextual and inferential vectors and embeddings around the words and sentences and paragraphs etc themselves, and that's the timewise unfolding of call and response and ongoing conversations, and the prospective soft target outcomes contained in that time series.

You wrote that post on the paper where you get timewise pieces and understandings for free from attention mechanisms, akin to LSTM in older methods, but more powerful.

But I think if you set up the right architecture, you could also tap into timewise changes in persuasion, emotional valence, etc. The soft targets under discussion. And the attention mechanism is already so well suited to this, all you really need is the architecture of "predict, validate against holdouts, x-fold validation," which seems like a really easy lift.

The labeling is really the sticking point there, I suppose, but again, there are options - already o3 and 2.5 Pro are smart enough that they're probably pretty formidable with any soft target deployment, so all you'd need to do is generate the labels with the one-gen-smarter internal models. Or maybe you could 'wisdom of crowds it" and take some consensus weighted vote, as you mention.

That doesn't speak unambiguously to iterated self improvement (I could see a truly smart model iterating beyond even a crowd of dumber models pretty easily - imagine von Neumann or Grothendieck against a crowd of average R1 mathematicians), but of course, the ultimate arbiter of truth should be reality itself. Instead of labeled holdouts, you can literally just deploy a fine tuned model in the world and see how it does on the soft targets, with self-evaluation of "predicted outcome vs real outcome" error. Just like humans learn.

Sure, the sample size you need for transformer learning is ungodly huge compared to the samples humans need, but these things operate at scale, and OpenAI has many hundreds of millions of DAU. And ideally, even this requirement will diminish with better architecture or clever scaffolding (and if so, this is a pretty high leverage overhang which could see capabilities really jump up suddenly).

I suppose "live learning" is the final frontier, which I'm sure the frontier labs are all thinking about.

Expand full comment

Wait maybe I'm misunderstanding something

> I thought the big gap between RL and gradient descent was that there's no verifiable targets in the middle between start and outcome, but RL gets you those thinking traces in the middle. But if we're generating the probabilistic traces anyways, why aren't you just gradient descenting those then, instead of dealing with finicky RL methods?

I think people are doing this! That's the 'sample a bunch from smarter models and fine tune (using gradient descent) on those', right? Or am I misunderstanding something?

If your question is just 'why bother with RL at all' I think its because its hard to generate good reasoning traces. The Deepseek approach isn't *really* an RL paper, it's not like novel RL. It's really novel for it's mechanism of generating new data. And I think implicitly we're supposed to understand that the reasoning traces produced this way are *better* than those produced by RLHF / CoT prompt engineering (once the RL model reaches convergence, anyway). Gwerns breadcrumb approach may get us to the same point eventually, but I imagine it would take quite a bit longer and may not actually reach the same point.

> of course, the ultimate arbiter of truth should be reality itself. Instead of labeled holdouts, you can literally just deploy a fine tuned model in the world and see how it does on the soft targets, with self-evaluation of "predicted outcome vs real outcome" error. Just like humans learn.

Yea I think this is where things eventually go, but it turns out this is also hard to evaluate. I'd be curious to see reasoning trace RL applied more directly to games before jumping straight to reality -- there's clear rules there that result in eventual reward, and if its like what were observing with math/code problems, improvement in games should result in improvement elsewhere.

> Sure, the sample size you need for transformer learning is ungodly huge compared to the samples humans need

Unclear! Say the human eye processes something like 30fps. That means by the time a human is 3 years old, it will have ingested / 'trained on' ~3 billion images. And that's just images! The human is also getting audio, touch, smell, taste, etc. etc. I think its kinda fascinating that transformers understand so much of our world given just a single input sense (text)

Expand full comment

> Wait maybe I'm misunderstanding something

Yeah, this is probably my fault for not being clear about which specific use case and method I'm referring to each time, sorry about that.

My broad understanding (please correct anything that's wrong):

Traditional reasoning models - use RL against known true targets like math and code and textbooks and exams, and it builds a better reasoning capability that then generalizes to being smart about softer things too.

Deepseek approach / distillation - elicit reasoning traces and answers (and possibly weighted probabilities around traces and answers in the best case, which is how o3-mini's and the like are created) from a smarter model, and use that to distill the search space down massively to get a big bump in performance from a smaller model, by pre-carving the search space to the already known / smart spaces.

Gwern breadcrumb approach - burn inference time on a given model to calculate better and better traces and reasoning (see the many inference time vs quality / smarts graphs), and then trim the cruft and dead ends, using the highest quality and most direct traces and answers as pre-seeds for the next smarter model's RL paths.

> I think people are doing this! That's the 'sample a bunch from smarter models and fine tune (using gradient descent) on those', right? Or am I misunderstanding something?

So this sounds to me like the Deepseek / distillation approach, is that accurate?

I didn't actually realize they were doing gradient descent there, I actually thought it was RL, with the traces as the seed. But if they are doing gradient descent, I should definitely read it closely.

My rough mental model of how Deepseek's distillation worked was something like using the reasoning traces and answer as the seed, then using RL to explore the probability space around the traces, because they couldn't get the probabilities directly from the o1 API call, like oAI can when creating o3-mini and similar models. Interested to know if / where that mental model is off.

Copy that on the games - although I guess I've never understood why self-play works for chess or go, but self-play doesn't work as well on platformers or other "real" games. It can't be a combinatorial space thing, the combinatorial space on chess or go is already absurd, and definitely not fully sampled when it's solving games like those.

> Unclear! Say the human eye processes something like 30fps. That means by the time a human is 3 years old, it will have ingested / 'trained on' ~3 billion images.

Oh, now we're getting into the fun stuff!

So famously, we only get 10b/s of conscious processing (per Zheng and Meister’s The Unbearable Slowness of Being - https://arxiv.org/pdf/2408.10234)

I once back-of-enveloped human and octopus unconscious calculation processing power for a post I did that looked at octopus texture-and-color camouflauging, and got roughly 10-100mb/s for humans and 3-20mb/s for octopuses - I even triangulated the 100mb/s ceiling for humans by looking at A100 calculation capacity and comparing the wattage for the GPU and for humans!

That stuff was here about 2/3 down, if you're interested: https://performativebafflement.substack.com/p/oceanic-alien-minds

But I think we have to ask ourselves, how many images are actually being processed? The eye in theory sees maybe 10mb/s of info - but is all that REALLY being processed, even by the unconscious? Especially by babies?

Because we can step it down really easily to much simpler and smaller brained organisms with similar eye capacities, and there's no way they could be, right?

Like it's mindblowing the behavioral suites that lizards or small mammals have. And we know a lot of that is effortfully carved over millions of years and compressed (somehow) and pre-seeded (somehow) into the fundamental neural architecture and instincts of those animals.

It's almost directly analogous to the breadcrumbing - you're using inference time over millions of years / organisms to carve a working solution through the combinatorial search space, and compressing and seeding that path as you go.

Humans (and probably other animals) have something directly analogous at a different time scale, too - I write often about coaching and elite performance on my substack, and the big differentiator for elites is that they have more comprehensive and better unconscious schemas in their minds that they've effortfully burned in, and this allows them to use their meager 10b/s of conscious attention and processing better. So we see the same dynamic reflected again, on a shorter time scale.

I think there's something like a training-data-needed to amount-pre-seeded efficient frontier, where the more smarts you have preseeded, the fewer training examples you need. Dogs learn tricks after lots of repetition and positive reinforcement, generally 50's to hundreds. Humans can learn after only a few. Humans have a lot more unconscious schemas written into their minds, and I imagine this is also the difference in smarts between an 03 vs an o1 or 4o, for example.

The difference in smarts between humans seems very much a "better unconscious schemas and so better ability to predict things" also, and smart people seem to spend a lot more time making predictions and thinking about why they were right or wrong.

Maybe this is the big difference - today we write a schema difference with training time and reasoning traces, but is there actually prediction at the relevant levels? And I know, "PB, you fool, prediction and refining a probability map is ALL that they're doing!" And sure, the attention heads let you map and predict the next word, and then even larger scales if you want, like sentences, paragraphs, reasoning traces, etc. My broad understanding is you create an ever better and more refined probability map of linguistic meaning and correlates, and this maps to a similar rough probability map of reality.

But they're not actually trying to predict non-word features of reality, right?

You might eventually infer "if I drop a glass, it breaks" from the frequency the concept occurs, but you never TRIED to predict what would happened to the conjoined concept of "glass" + "dropped" in a sort of holdout before running across the concept 100k times.

But just look at any 2 year old - those little miscreants are *definitely* trying to predict what would happen if you drop a glass, or put it in your mouth, or bang it on the table, etc.

I guess I'm just back at "holding out and cross validating" but at larger scales of concepts. But is this actually close to what's going on when trying to predict better reasoning traces? This is a gap in my understanding, understanding at which conceptual and inferential level that occurs at.

Sorry, I realize this got long, don't want to be disrespectful of your time, just got a little carried away.

I so greatly enjoy these exchanges, because I always surface such interesting gaps in my understanding!

Expand full comment

https://blog.ai-futures.org/p/making-sense-of-openais-models

> My broad understanding (please correct anything that's wrong)

I think you've got some terminology wrong. This is the progression *as I understand it*. I label the time periods as the range where this was seen as a way to get more juice out of the model

- next token prediction using autoregressive loss (2019-2023)

- instruction tuning / RLHF (2022 - 2023)

- chain of thought / "show your work" prompt engineering (2023 - )

- sampling better chain of thought reasoning using humans and then finetune using backprop (this is the bread crumb approach) (2024 - )

- sampling better chain of thought reasoning using RL on true targets and then finetune using backprop (you called this 'traditional reasoning models' but I think this is what *deepseek* is doing) (2024 - )

The distillation thing is done throughout. Distillation is a pretty well known thing that is separate from RL. Though I can see how you got that mixed up, distillation kind of looks like an RL approach where the 'teacher model' is the reward function. The key difference is that distillation is done entirely w/ backprop.

The interesting thing about deepseek is that they got so far without any supervised finetuning. But the model that resulted was...weird. It would mix languages together and not really listen to instructions properly and a few other things. So they sampled the RL model and curated *that* for reasoning traces, and then did finetuning. That's my understanding, at least.

Here is the relevant text from the deepseek paper:

> When reasoning-oriented RL converges, we utilize the resulting checkpoint to collect SFT (Supervised Fine-Tuning) data for the subsequent round...We curate reasoning prompts and generate reasoning trajectories by performing rejection sampling from the checkpoint from the above RL training...Additionally, because the model output is sometimes chaotic and difficult to read, we have filtered out chain-of-thought with mixed languages, long parapraphs, and code blocks. For each prompt, we sample multiple responses and retain only the correct ones. In total, we collect about 600k reasoning related training samples

I think deepseek is important because it provides an automatic way of generating good reasoning traces en masse, which i think did not really exist before. What you call the 'traditional' approach was actually very new!

> The eye in theory sees maybe 10mb/s of info - but is all that REALLY being processed, even by the unconscious? ... Like it's mindblowing the behavioral suites that lizards or small mammals have. And we know a lot of that is effortfully carved over millions of years and compressed (somehow) and pre-seeded (somehow) into the fundamental neural architecture and instincts of those animals.

A while back I was doing connectomics research on fruitfly retinas. I think it's really hard to say that the data is *not* being used. In part because I think if it wasn't being used, surely evolution would have gotten rid of the extra capacity? But also because there is a whole lot of stuff that looks a lot like a high dimensional embedding space in biological circuitry, which the rest of the organism's brain is interacting with seemingly "deterministically" (i say that with quotes because even for really small models, like c elegans, we dont have complete understanding of what inputs lead to which outputs). 10mb/s is a *lot* but also maybe some of the video understanding models are taking in data at that scale? Like back of the envelope, convert a video into just images and run a big convolutional network on it, its maybe ingesting 10mb/s? (the per-second part makes the comparison not exactly perfect but you get the idea)

> I imagine this is also the difference in smarts between an 03 vs an o1 or 4o, for example.

this post goes into detail about the different gpt models -- it seems like the various o* models are finetuned versions of different base models. So yea, the schema thing isn't a bad analogy here, where the 'schema' gets baked in by training. In some sense its all compression, and presumably your compression gets better the more data you see (though there is an asymptotic limit somewhere)

> But just look at any 2 year old - those little miscreants are *definitely* trying to predict what would happen if you drop a glass, or put it in your mouth, or bang it on the table, etc.

> I guess I'm just back at "holding out and cross validating" but at larger scales of concepts. But is this actually close to what's going on when trying to predict better reasoning traces? This is a gap in my understanding, understanding at which conceptual and inferential level that occurs at.

One thing that is not really at all clear is the way in which 'memories' interact with 'process', both in biological circuits and in neural networks. Like, the 2yo in your example is basically running an experiment, and then it files the results away. Does it change the underlying 'process' so that next time the 2yo does the same thing some neurons fire giving it a prediction of what to expect? Or does it change some underlying memory storage, such that the next time the 2yo is in a similar position it retrieves that memory? I think the answer is both, which is something I've discussed in my paper review series.

Currently though, predicting better reasoning traces is really 'just' finetuning next token prediction, iiuc. Which to your point is obviously limited. Part of why deepseek was interesting was because it was a method of improving reasoning traces WITHOUT next token prediction. The S1 paper is also interesting for this reason. More generally, we need ways to move across the reasoning manifold that don't depend on just predicting words.

> Sorry, I realize this got long, don't want to be disrespectful of your time, just got a little carried away.

> I so greatly enjoy these exchanges, because I always surface such interesting gaps in my understanding!

Not at all! The feeling is mutual

Expand full comment