ExFusion: Efficient Transformer Training via Multi-Experts Fusion
arXiv:2603.27965v1 Announce Type: new
Abstract: Mixture-of-Experts (MoE) models substantially improve performance by increasing the capacity of dense architectures. However, directly training MoE models requires considerable computational resources an…