Does Your Optimizer Care How You Normalize? Normalization-Optimizer Coupling in LLM Training
arXiv:2604.01563v1 Announce Type: cross
Abstract: In LLM training, normalization layers and optimizers are typically treated as independent design choices. In a 3×2 factorial at 1B parameters and 1000 training steps, we show this assumption can fail: …