Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection
arXiv:2602.03216v2 Announce Type: replace
Abstract: The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured …