Hierarchical Mixture-of-Experts with Two-Stage Optimization
arXiv:2605.08292v1 Announce Type: cross
Abstract: Sparse Mixture-of-Experts (MoE) models scale capacity by routing each token to a small subset of experts. However, their routers exhibit a fundamental trade-off: strong load balancing can suppress expe…