cs.AI, cs.CL, cs.LG

KVSculpt: KV Cache Compression as Distillation

arXiv:2603.27819v1 Announce Type: new
Abstract: KV cache compression is critical for efficient long-context LLM inference. Approaches that reduce the per-pair footprint — quantization and low-rank decomposition — are orthogonal to those that reduce …