LocalLLaMA

Expert Upcycling: Growing MoE capacity mid-training without increasing inference cost (7B→13B, ~32% GPU hours saved)

Author here, sharing a preprint we recently released. We're actively looking for feedback from this community before we revise. Motivation. Training large MoEs from scratch is expensive. All expert weights, gradients, and optimizer states mus…