Laplacian Heads Improve Transformers by Smoothing Token Representations
arXiv:2602.09297v2 Announce Type: replace
Abstract: Transformers update token representations through multi-head attention and residual connections as $X \leftarrow X + \sum_{i} P^{(i)}XW_{V_i}W_{o_i}$, where $P^{(i)}$ is the softmax attention matrix …