Maybe I was too harsh on deep learning theory (three days ago)

A few days ago, I reviewed a paper titled “There Will Be a Scientific Theory of Deep Learning". In it, I expressed appreciation for the authors for writing the piece, but skepticism for stronger forms of their titular claims.

Since then I’ve spoken with various past collaborators (via text and in person), and read or reread quite a few deep learning theory papers, including the bombshell Zhang et al. 2016 and Nagarajan et al. 2019 papers that I wrote about on LessWrong.

And the thing is, parts of the infinite width/depth-limit work turned out to be much more interesting than I thought it was. Perhaps I have judged deep learning theory (a bit) too harshly.

A lot of my impression for the infinite-width and depth-limit work comes from the neural tangent kernel/neural network Gaussian Process line of work. This line of work starts from Radford Neal’s 1994 paper, where he noted that an infinitely-wide single hidden-layer neural network with random weights is a Gaussian Process. In 2017/2018, this work was extended to deep neural networks; it was shown by Lee et al. that a randomly initialized deep neural network was, if you took a certain type of infinite width limit, also a Gaussian Process. This was then extended to the Neural Tangent Kernel work, which described the training dynamics of these infinitely wide neural networks, and showed that it was equivalent to kernel gradient descent with a fixed kernel (the eponymous Neural Tangent Kernel). This allowed people to derive convergence properties and nontrivial generalization bounds.

Unfortunately, while beautiful, it was definitely not how neural networks learn. In the NTK limit, the network behaves as if it were doing linear regression of size equal to the number of parameters. Notably, there is no feature learning, and only the last layer weights are updated by a noticeable amount. Unsurprisingly, this does not describe the behavior of neural networks; small (finite width) neural networks have been shown to outperform their equivalent tangent kernels.

An alternative way of taking an infinite width limit is Mean Field Theory (MFT, applied to deep neural networks). As I understand it, the basic idea behind Mean Field Theory in physics is that, instead of calculating the interactions between many objects, you replace the many-body interactions with an average “field” that captures the overall dynamics of the system. (Hence the name.) In neural network land, it turns out that you can take a different infinite-width limit in which the empirical distribution of hidden-unit parameters, viewed as a probability measure on parameter space, evolves under a deterministic flow. This was worked out around 2018 by Mei, Montanari, and Nguyen, Chizat and Bach, Rotskoff and Vanden-Eijnden, and Sirignano and Spiliopoulos.

Notably, in this different infinite width limit, networks actually learn features. NTK uses 1/√N scaling, which makes parameters move only O(1/√N) during training: too small to change the effective kernel. Mean-field uses 1/N scaling, which lets parameters move Θ(1), so the kernel evolves and hidden representations change over the course of training. In MFT, the model is doing something other than glorified linear regression in a fixed random feature space. That being said, for a few years, MFT was entirely a theory of 2-layer neural networks, and it was genuinely unclear how to extend this to deeper networks.

As with most of the deep learning community, I was very impressed by the Tensor Program work of Greg Yang, which was a natural extension of the 2-layer MFT work. Greg Yang proved a series of theorems that allowed him to create a unifying framework (abc parameterization) for deep neural networks, where NNGP/NTK and MFT were special cases of this family. Notably, this allowed him to derive perhaps the only clear application of deep learning theory in μP (maximal-update parameterization), which allows hyperparameter transfer across depth.

I sometimes run into bright young people with plenty of interest in math but not so much in engineering, who ask me what they should study. Beyond the very basics of deep learning (e.g. optimizers, basic RL theory), I used to give a shrug and say “Maybe computation in superposition? Maybe Singular Learning Theory?”. From now I think I’ll probably start my answer with “Mean Field Theory”.

Also, uh, Greg Yang is currently on leave for health reasons from xAI. Maybe someone should see if he’s getting better, and try to work with him?

Discuss

Leave a Comment