cs.AI, cs.LG, math.OC

A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints

arXiv:2605.04595v1 Announce Type: new
Abstract: The rapid adoption of large language models (LLMs) has created significant challenges for efficient inference at scale. Unlike traditional workloads, LLM inference is constrained by both computation and …