Muon Dynamics as a Spectral Wasserstein Flow
arXiv:2604.04891v1 Announce Type: cross
Abstract: Gradient normalization is central in deep-learning optimization because it stabilizes training and reduces sensitivity to scale. For deep architectures, parameters are naturally grouped into matrices o…