cs.DC, cs.LG

TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference

arXiv:2505.11329v5 Announce Type: replace-cross
Abstract: Distributed inference of large language models (LLMs) using tensor parallelism can introduce communication overheads of $20$% even over GPUs connected via NVLink, a high-speed GPU interconnect….