Author name: Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, Xue Liu

Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference

Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, Xue Liu / April 17, 2026

arXiv:2604.09613v2 Announce Type: replace-cross
Abstract: Production vLLM fleets provision every instance for worst-case context length, wasting 4-8x concurrency on the 80-95% of requests that are short and simultaneously triggering KV-cache failures …

cs.AI, cs.DC

Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference

Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, Xue Liu / April 14, 2026

arXiv:2604.09613v1 Announce Type: cross
Abstract: Production vLLM fleets provision every instance for worst-case context length, wasting 4-8x concurrency on the 80-95% of requests that are short and simultaneously triggering KV-cache failures — OOM c…