Krause Synchronization Transformers
arXiv:2602.11534v3 Announce Type: replace
Abstract: Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. When composed across depth, this interaction pattern induces …