cs.LG

When and Why Grouping Attention Heads Accelerates Muon Optimization

arXiv:2605.08933v1 Announce Type: new
Abstract: Muon orthogonalizes matrix updates, but multi-head attention naturally operates at the level of heads. This granularity mismatch raises the question of whether Muon should be applied to the full attentio…