What is the point of MoE models, beyond being faster?

Hi. Besides the fact that an xByA MoE models runs as fast as a yA models but produces better results, what are other benefits of pursuing an MoE architecture and not a dense one with e.g. x/2 (or x/3) parameters?

Given that we need enough RAM for xB parameter anyway, aren't MoEs at a disadvantage when RAM is scarce, like the current situation?

And thinking of limit cases, is there a limit on x/y, so that it doesn't make sense e.g. to train a 100B1A MoE model?

Thanks.

submitted by /u/ihatebeinganonymous
[link] [comments]

Leave a Comment