LayerNorm Induces Recency Bias in Transformer Decoders
arXiv:2509.21042v4 Announce Type: replace
Abstract: Causal self-attention provides positional information to Transformer decoders. Prior work has shown that stacks of causal self-attention layers alone induce a positional bias in attention scores towa…