cs.AI, cs.LG, math.ST, stat.ML, stat.TH

Generalization and Scaling Laws for Mixture-of-Experts Transformers

arXiv:2604.09175v1 Announce Type: new
Abstract: We develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that cleanly separates \emph{active} per-input capacity from routing combinatorics. By conditioning on fixed ro…