Hongtao Zhang, Wenjie Zhou, Wei Chen, Xueqi Cheng

When and Why Grouping Attention Heads Accelerates Muon Optimization

Hongtao Zhang, Wenjie Zhou, Wei Chen, Xueqi Cheng / May 12, 2026

arXiv:2605.08933v1 Announce Type: new
Abstract: Muon orthogonalizes matrix updates, but multi-head attention naturally operates at the level of heads. This granularity mismatch raises the question of whether Muon should be applied to the full attentio…

Author name: Hongtao Zhang, Wenjie Zhou, Wei Chen, Xueqi Cheng

When and Why Grouping Attention Heads Accelerates Muon Optimization