MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
arXiv:2605.05225v2 Announce Type: replace-cross
Abstract: Mixture-of-Experts Multimodal Large Language Models (MoE MLLMs) suffer from a significant efficiency bottleneck during Expert Parallelism (EP) inference due to the straggler effect. This issue …