Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions
arXiv:2604.00499v1 Announce Type: new
Abstract: To schedule LLM inference, the \textit{shortest job first} (SJF) principle is favorable by prioritizing requests with short output lengths to avoid head-of-line (HOL) blocking. Existing methods usually p…