Accelerating Prefilling via Decoding-time Contribution Sparsity
arXiv:2507.21526v4 Announce Type: replace
Abstract: Large Language Models (LLMs) incur quadratic attention complexity with input length, creating a major time bottleneck in the prefilling stage. Existing acceleration methods largely exploit attention …