Zizhuo Fu, Wenxuan Zeng, Runsheng Wang, Meng Li

Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

Zizhuo Fu, Wenxuan Zeng, Runsheng Wang, Meng Li / May 5, 2026

arXiv:2602.01203v2 Announce Type: replace
Abstract: Large Language Models (LLMs) often assign disproportionate attention to the first token, a phenomenon known as the attention sink. Several recent approaches aim to address this issue, including Sink …

Author name: Zizhuo Fu, Wenxuan Zeng, Runsheng Wang, Meng Li

Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse