A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints
arXiv:2605.04595v1 Announce Type: new
Abstract: The rapid adoption of large language models (LLMs) has created significant challenges for efficient inference at scale. Unlike traditional workloads, LLM inference is constrained by both computation and …