LocalLLaMA

FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8

Last year researchers affiliated with NVIDIA, University of Warsaw, and University of Edinburgh published Dynamic Memory Sparsification (DMS), a KV-cache sparsification technique using learned per-head token eviction, reporting up to 8x KV-cache compre…