FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management
arXiv:2511.00868v2 Announce Type: replace
Abstract: Large Language Model (LLM) serving is increasingly constrained by the growing size of the key-value (KV) cache, which scales with both context length and generation length. Prior work shows that atte…