Tight Clusters Make Specialized Experts
arXiv:2502.15315v3 Announce Type: replace
Abstract: Sparse Mixture-of-Experts (MoE) architectures have emerged as a promising approach to decoupling model capacity from computational cost. At the core of the MoE model is the router, which learns the u…