cs.DC, cs.LG

PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving

arXiv:2604.12171v1 Announce Type: cross
Abstract: Pipeline parallelism (PP) is widely used to partition layers of large language models (LLMs) across GPUs, enabling scalable inference for large models. However, existing systems rely on static PP confi…