Gradient Flow Structure and Quantitative Dynamics of Multi-Head Self-Attention
arXiv:2605.04279v1 Announce Type: new
Abstract: Transformer self-attention can be interpreted as a gradient flow on the unit sphere, in which tokens evolve under softmax interaction potentials and tend to form clusters. While prior work has establishe…