Positional Encoding via Token-Aware Phase Attention
arXiv:2509.12635v3 Announce Type: replace-cross
Abstract: We prove under practical assumptions that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE's ability to model long-context. RoPE extension methods may alleviate this issue, but they typically require post-hoc adjustments after pretraining, such as rescaling or hyperparameters retuning. This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism. TAPA preserves token interactions over long range, extends to longer contexts with direct and light continual pretraining, extrapolates to unseen lengths, and attains substantially lower perplexity and stronger retrieval performance in the long-context regime than RoPE-style baselines.