Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference
arXiv:2604.09613v2 Announce Type: replace-cross
Abstract: Production vLLM fleets provision every instance for worst-case context length, wasting 4-8x concurrency on the 80-95% of requests that are short and simultaneously triggering KV-cache failures …