EMO: Frustratingly Easy Progressive Training of Extendable MoE
arXiv:2605.13247v2 Announce Type: replace
Abstract: Sparse Mixture-of-Experts (MoE) models offer a powerful way to scale model size without increasing compute, as per-token FLOPs depend only on k active experts rather than the total pool of E experts….