cs.AI, cs.LG

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

arXiv:2605.05219v1 Announce Type: cross
Abstract: Prefix caching is a key latency optimization for autoregressive LLM serving, yet existing systems assume dense per-token key/value reuse. State-space models change the structure of the problem: a recur…