cs.AI, cs.LG

CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference

arXiv:2604.08584v1 Announce Type: new
Abstract: Long-context LLMs increasingly rely on extended, reusable prefill prompts for agents and domain Q&A, pushing attention and KV-cache to become the dominant decode-time bottlenecks. While sparse attention …