Raja Gond, Nipun Kwatra, Ramachandran Ramjee

TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference

Raja Gond, Nipun Kwatra, Ramachandran Ramjee / May 4, 2026

arXiv:2505.11329v5 Announce Type: replace-cross
Abstract: Distributed inference of large language models (LLMs) using tensor parallelism can introduce communication overheads of $20$% even over GPUs connected via NVLink, a high-speed GPU interconnect….

Author name: Raja Gond, Nipun Kwatra, Ramachandran Ramjee

TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference