Hello all, the docs for the vLLM production stack suggested autoscaling the vllm worker instances based on the number of waiting requests, but it seems like this would only help with new coming requests? We are having burst LLM calls which overwhelm our pods/instances which would technically scale up other instances but since there's nothing redirecting the requests on the hot pods/instances, we found ourselves in a situation where some of our pods are handling a large number of waiting requests while newly scaled up pods/instances are doing nothing, is there any solutions for this?
[link] [comments]