cs.CL, cs.LG

MoE-nD: Per-Layer Mixture-of-Experts Routing for Multi-Axis KV Cache Compression

arXiv:2604.17695v1 Announce Type: cross
Abstract: KV cache memory is the dominant bottleneck for long-context LLM inference. Existing compression methods each act on a single axis of the four-dimensional KV tensor — token eviction (sequence), quantiz…