cs.LG

Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference

arXiv:2605.06046v1 Announce Type: new
Abstract: Auto-regressive token generation in large language models is memory-bound because it requires “attending to” key and value tensors (KV cache) of all previous tokens. Prior work aims to improve the effici…