Expert Upcycling: Growing MoE capacity mid-training without increasing inference cost (7B→13B, ~32% GPU hours saved)

Author here, sharing a preprint we recently released. We're actively looking for feedback from this community before we revise.

Motivation. Training large MoEs from scratch is expensive. All expert weights, gradients, and optimizer states must reside in accelerator memory regardless of how few are active per token, and all-to-all communication can consume 45–50% of step time on standard GPU clusters. Both costs scale with total expert count, which is in tension with scaling laws that recommend lower activation ratios (more experts at fixed active parameters) for better quality-per-FLOP.

Method. We introduce expert upcycling: given a trained E-expert MoE, we expand to mE experts mid-training by duplicating existing experts and extending the router with small bias noise on replicas. Top-K routing is held fixed, so per-token FLOPs and inference cost are unchanged. Continued pre-training then breaks the symmetry among duplicated experts, driving specialization. The key enabler is loss-free load balancing, which guarantees every replica receives gradient signal and prevents routing collapse.

Results. On a 7B→13B interleaved MoE (32→64 experts, Top-2, architecture similar to Llama 4):

Validation loss: 1.263 (upcycled) vs. 1.267 (fixed-64 from scratch)
Average accuracy across 11 downstream benchmarks: 56.4 vs. 56.7
GPU hours: ~32% reduction vs. training the 64-expert model from scratch
~67% reduction in the sunk-cost setting where the 32-expert checkpoint already exists

We also validate on a full MoE with 256 experts and Top-8 routing (matching DeepSeek-V3, Kimi K2, and GLM-4.5 configurations), showing the approach generalizes beyond the interleaved architecture.

Paper: https://huggingface.co/papers/2604.19835

Code and training configurations: github.com/amazon-science/expert-upcycling

Happy to discuss the method, ablations (including a practical recipe for transition timing and duplication strategy), the theoretical framing, or training setup in detail — and genuinely interested in pushback on limitations and failure modes we may not have stress-tested.

https://preview.redd.it/7hzzkopus0xg1.png?width=1084&format=png&auto=webp&s=62481a3e621a221ca4ad45c45abd6db018a25244

submitted by /u/Pigs-On-Wing
[link] [comments]

Leave a Comment