cs.AI, cs.LG

RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache

arXiv:2605.08317v1 Announce Type: cross
Abstract: Large language models (LLMs) have shown strong performance across diverse tasks, but their inference with long input contexts is bottlenecked by memory size and bandwidth. The Key-Value (KV) cache size…