DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference
arXiv:2604.24647v1 Announce Type: new
Abstract: Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregr…