cs.AI, cs.CL, cs.LG

Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference

arXiv:2601.07667v2 Announce Type: replace-cross
Abstract: Due to the prevalence of large language models (LLMs), key-value (KV) cache reduction for LLM inference has received remarkable attention. Among numerous works that have been proposed in recent…