Attention Sinks Induce Gradient Sinks: Massive Activations as Gradient Regulators in Transformers
arXiv:2603.17771v2 Announce Type: replace
Abstract: Attention sinks and massive activations are recurring and closely related phenomena in Transformer models. Existing explanations have largely focused on the forward pass, yet in pre-norm Transformers…