MiTA Attention: Efficient Fast-Weight Scaling via a Mixture of Top-k Activations
arXiv:2602.01219v4 Announce Type: replace
Abstract: The attention operator in Transformers can be viewed as a two-layer fast-weight MLP, whose weights are dynamically instantiated from input tokens and whose width equals sequence length N. As the cont…