Focus and Dilution: The Multi-stage Learning Process of Attention
arXiv:2605.01199v1 Announce Type: new
Abstract: Transformer-based models have achieved remarkable success across a wide range of domains, yet our understanding of their training dynamics remains limited. In this work, we identify a recurrent focus-dil…