From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill
arXiv:2510.08055v2 Announce Type: replace
Abstract: Large Language Model (LLM) inference in production must meet stringent service-level objectives for both time-to-first-token (TTFT) and time-between-token (TBT) while maximizing throughput under fixe…