Rethinking Language Model Scaling under Transferable Hypersphere Optimization
arXiv:2603.28743v2 Announce Type: replace
Abstract: Scaling laws for large language models depend critically on the optimizer and parameterization. Existing hyperparameter transfer laws are mainly developed for first-order optimizers, and they do not …