| Author here, sharing a preprint we recently released. We're actively looking for feedback from this community before we revise. Motivation. Training large MoEs from scratch is expensive. All expert weights, gradients, and optimizer states must reside in accelerator memory regardless of how few are active per token, and all-to-all communication can consume 45–50% of step time on standard GPU clusters. Both costs scale with total expert count, which is in tension with scaling laws that recommend lower activation ratios (more experts at fixed active parameters) for better quality-per-FLOP. Method. We introduce expert upcycling: given a trained E-expert MoE, we expand to mE experts mid-training by duplicating existing experts and extending the router with small bias noise on replicas. Top-K routing is held fixed, so per-token FLOPs and inference cost are unchanged. Continued pre-training then breaks the symmetry among duplicated experts, driving specialization. The key enabler is loss-free load balancing, which guarantees every replica receives gradient signal and prevents routing collapse. Results. On a 7B→13B interleaved MoE (32→64 experts, Top-2, architecture similar to Llama 4):
We also validate on a full MoE with 256 experts and Top-8 routing (matching DeepSeek-V3, Kimi K2, and GLM-4.5 configurations), showing the approach generalizes beyond the interleaved architecture. Paper: https://huggingface.co/papers/2604.19835 Code and training configurations: github.com/amazon-science/expert-upcycling Happy to discuss the method, ablations (including a practical recipe for transition timing and duplication strategy), the theoretical framing, or training setup in detail — and genuinely interested in pushback on limitations and failure modes we may not have stress-tested. [link] [comments] |