The Norm-Separation Delay Law of Grokking: A First-Principles Theory of Delayed Generalization

arXiv:2603.13331v2 Announce Type: replace Abstract: Grokking -- the sudden generalisation that appears long after a model has perfectly memorised its training data -- has been widely observed but lacks a quantitative theory explaining the length of the delay. We show that grokking is a norm-driven representational phase transition in regularised training dynamics, and establish the Norm-Separation Delay Law: $T_{\mathrm{grok}} - T_{\mathrm{mem}} = \Theta(\gamma_{\mathrm{eff}}^{-1} \log(\|\theta_{\mathrm{mem}}\|^2 / \|\theta_{\mathrm{post}}\|^2))$, where $\gamma_{\mathrm{eff}}$ is the optimiser's effective contraction rate ($\gamma_{\mathrm{eff}} = \eta\lambda$ for SGD, $\gamma_{\mathrm{eff}} \ge \eta\lambda$ for AdamW). The upper bound follows from a discrete Lyapunov contraction argument; the matching lower bound from dynamical constraints of regularised first-order optimisation. Across 293 training runs spanning modular addition, modular multiplication, and sparse parity, we confirm three falsifiable predictions: inverse scaling with weight decay ($R^2 = 0.97$), inverse scaling with learning rate ($R^2 = 0.92$), and logarithmic dependence on the norm ratio (Pearson $r = 0.91$). A fourth finding reveals that grokking requires an optimiser capable of decoupling memorisation from contraction: SGD fails entirely at the same hyperparameters where AdamW reliably groks. These results reframe grokking not as a mysterious optimisation artefact but as a predictable consequence of norm separation between competing interpolating representations. We further derive a practical three-input algorithm that predicts grokking delay at memorisation time with 34.6% mean absolute error (bootstrap 95% CI [30.0%, 39.4%], $N=60$ seeds), enabling principled early stopping.

Leave a Comment