Generalization and Scaling Laws for Mixture-of-Experts Transformers
arXiv:2604.09175v1 Announce Type: new
Abstract: We develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that cleanly separates \emph{active} per-input capacity from routing combinatorics. By conditioning on fixed ro…