When and Why Grouping Attention Heads Accelerates Muon Optimization
arXiv:2605.08933v1 Announce Type: new
Abstract: Muon orthogonalizes matrix updates, but multi-head attention naturally operates at the level of heads. This granularity mismatch raises the question of whether Muon should be applied to the full attentio…