Ilya's 30 Papers to Carmack: NN Regularization
This post is part of a series of paper reviews, covering the ~30 papers Ilya Sutskever sent to John Carmack to learn about AI. To see the rest of the reviews, go here.
Paper 6: Keeping Neural Networks Simple by Minimizing the Description Length of the Weights
https://www.cs.toronto.edu/~fritz/absps/colt93.pdf
High Level Summary
First off, this is an old paper for the field — Hinton, 1993. That’s not quite at the founding of deep learning, but the original backprop/MLP paper was only in ‘86. We’ve come a long way since then.
The paper proposes a way of regularizing model weights using Gaussian Mixtures. A Gaussian model is a line-fitting algorithm where you adjust two parameters — mean, variance — to try to fit a Gaussian curve to some data. Similar to linear regression, just with a Gaussian.
Like all regularization, the goal of this Gaussian Mixture method is to avoid model overfitting. Overfitting occurs when the model gets really good at predicting the training data and sucks at predicting the test data. More concretely, the model hones in on noise, stuff that is present in the training data but isn’t actually relevant for the overall task at hand.
One way to prevent overfitting is to simply add noise to the model weights, making it worse at predictions. You can do that, in turn, by adding noise to each weight. That noise can come from any probability distribution, including a Gaussian Mixture curve.
This paper builds on the Minimum Description Length theory (Sections 1 - 4), which roughly states that the “optimal” model to represent a set of data is “the one that minimizes the combined cost of describing the model and describing the mis-fit between the model and the data” — that is to say, it minimizes training loss AND model size. Why might the MDL be true? Imagine we only cared about getting a really low loss on our training data — for any model that generates some loss, we could always make a bigger or more complex model which would generate a smaller loss. But of course, doing so makes the model worse at predicting new data (because of overfitting, above). “So we need some way of deciding when extra complexity in the model is not worth the improvement in the data-fit.” Thus, MDL.
The paper then dives into a new way of calculating the model description, using a set of Gaussian curves (sections 5 - 7). This is the brunt of the paper and, I’ll be honest, the part that I least understood. I think Section 5 is primarily about explaining how to apply Gaussian noise to weights, and how to then measure the description length of weights with noise added to them. Section 6 then makes the Gaussian noise flexible to the data distribution, by allowing the means/variances of the Gaussian mixture to be trained alongside the weights. Both the model and the mixture are trained to minimize the combined MDL loss. Section 7 formalizes the coding scheme.
For the most part I skipped sections 8 and 9, though I will say the latter had one of the worst figures I’d ever seen, ever.
The one thing worth mentioning here is that the default outcome, after all this work, is that all the weights get set to an equal negative value — effectively, it kills the network. So they had to implement some additional hacks on top to make the thing work (more on this later). When they finally do get it all to work, the results are somewhat underwhelming.
Section 10 (the Discussion) briefly discusses other ways to set weights in a model, and describes some minutia about future directions for this particular regularization scheme.
Insights
This part felt key to me:
If we fix its mean and variance in advance we could pick inappropriate values that make it very expensive to code the actual weights.
...
During the optimization, the means, variances and mixing proportions in the mixture adapt to model the clusters in the weight values. Simultaneously, the weights adapt to the current mixture model so weights get pulled towards the centers of nearby clusters. Suppose, for eaxample, that there are two Gaussians in the mixture. If one gaussian has mean 1 and low variance and the other gaussian has mean 0 and low
variance it is very cheap to encode low-variance weights with values near 1 or 0.
I think Hinton started with the question “how can I apply noise to these model weights.” Choosing a fixed Gaussian noise for each weight is a reasonable starting point, but it probably didn’t work too well because the Gaussian defaults (0 mean, 1 variance) may not really adequately reflect what the model needs to represent the underlying data distribution. The weights of the model actually do represent the underlying data, but he wants to add the noise before the weights are trained. He doesn’t have the weights a priori, so he can’t easily set the Gaussian parameters a priori. Well, if you can calculate derivatives for the mean and variance of a Gaussian, you can train it with backprop. It probably isn’t perfect, but may get you close enough. And thus, the rest of this paper.
I think a lot of the MDL theory is potentially useful backdrop, but I really don’t get why all the information coding needed to be front and center to actually understand the regularization method.
As an aside, training the noise distribution while training the weights seems like it leads to some really complex dynamics. Imagine you ran k-means on the weights during each training step, and dynamically assigned loss to weights that were ‘far’ from the k-means centroid. As the paper mentioned, this would cause the weights to cluster together towards those centroids. But since its a dynamic system, it would also cause those centroids to move around at each step.
Commentary
OK so like what the fuck? Why was this paper included in Ilya’s list? You have 30 papers, why this one? I mean, it’s a Hinton paper and it’s old, so it definitely has some clout. But in the modern era we do regularization with things like batch norms, layer norms, and dropout, as well as even older things like L1/L2 norms (the old-timey vocab for this was ‘weight decay’, which this paper makes explicit reference to). We never train Gaussian models to add noise to weights using derivatives tables to train during backprop, it’s just way too complicated. And in the modern era, we also care a lot less about overfitting anyway, because we just have so much data. (Training on the entire internet kind of precludes the problem of your model memorizing everything). The specific implementation details of this paper have clearly not stood the test of time. So again, why this paper?
A few things stood out to me.
First, I liked the way the paper described overfitting and the need for regularization. It was incredibly concise, even as it effectively conveyed a very non-intuitive concept. I also like the concept of using the MDL as a measure of model complexity — it fits neatly into some intuitions about how to select models that I previously couldn’t really explain.
Second, I was struck by how much this paper is a relic of the past. In 2024, the only way to train deep neural nets is using backprop. But in 93, this wasn’t obvious. Section 10 opens with a discussion about the ‘optimal way to set weights’ (without training) and talks about how Monte Carlo sims may be better than backprop because they ‘do not impose unrealistically simple assumptions about the shape of the posterior’ — that is, they don’t assume the loss curve is an N-dimensional bowl. No one thinks about these things anymore! But maybe they should at least know about it.
Third, in the middle of Section 9, the paper says:
If we penalize the weights by the full cost of describing them, the optimization quickly makes all of the weights equal and negative and uses the bias of the output unit to fix the output at the mean of the desired values in the training set. It seems that it is initially very easy to reduce the combined cost function by using weights that contain almost no information, and it is very hard to escape from this poor solution.
That is to say, “we did all this theoretical work to justify why this should work, and it didn’t like at all”. It’s easy to forget in the era of foundation models, where things mostly just work, but so much of early deep learning is like this! It’s just hacks all the way down1.
Fourth, this could be the first time KL Divergence appears in a deep learning context (disclaimer: I have no idea if that’s actually true). KL Divergence is pretty important in most ML theory — many loss functions are approximations of the KL Divergence between two distributions.
I doubt that this is why Ilya sent this paper to Carmack, but it stuck out to me.