cs.LG

FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving

arXiv:2604.02715v1 Announce Type: new
Abstract: Mixture-of-Experts (MoE) models have become a dominant paradigm for scaling large language models, but their rapidly growing parameter sizes introduce a fundamental inefficiency during inference: most ex…