cs.LG, math.PR, stat.ML

A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention

arXiv:2605.12697v1 Announce Type: new
Abstract: Length-dependent logit rescaling is widely used to stabilize long-context self-attention, but existing analyses and methods suggest conflicting inverse-temperature laws for the context length $n$, rangin…