cs.AI, cs.DC

StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving

arXiv:2604.09562v1 Announce Type: cross
Abstract: Efficient LLM serving must balance throughput and latency across diverse, bursty workloads. We introduce StreamServe, a disaggregated prefill decode serving architecture that combines metric aware rout…