cs.LG

Fast MoE Inference via Predictive Prefetching and Expert Replication

arXiv:2605.11537v1 Announce Type: new
Abstract: The Mixture of Experts (MoE) architecture has become a fundamental building block in state-of-the-art large language models (LLMs), improving domain-specific expertise in LLMs and scaling model capacity …