Spectral Condition for $\mu$P under Width-Depth Scaling
arXiv:2603.00541v2 Announce Type: replace-cross
Abstract: Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes. While maximal update parameterization ($\mu$P) has provided a principled solution to both problems for width scaling, existing extensions to the joint width-depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories. In this work, we develop a simple and unified spectral framework for $\mu$P under joint width-depth scaling. For deep residual networks whose residual blocks contain $k$ transformations, the framework specifies how the norms of weights and their per-step updates should scale with width and depth. It reveals a fundamental transition from $k=1$ to $k\geq 2$, unifying previously disparate $\mu$P formulations and identifying the $k\geq 2$ case as more appropriate for practical architectures with multi-transformation branches such as Transformers. Building on this framework, we derive a general recipe for implementing $\mu$P across a broad class of optimizers by mapping spectral constraints to concrete HP parameterizations, recovering existing results and extending them to additional optimizers. Finally, experiments on GPT-2 style language models show that the $\mu$P formulation derived from the $k\geq 2$ case achieves stable feature learning and robust HP transfer under width-depth scaling, whereas standard parameterization and $\mu$P in the $k=1$ case often fail to do so. These results support the practical effectiveness of the proposed spectral framework.