Hi. Besides the fact that an xByA MoE models runs as fast as a yA models but produces better results, what are other benefits of pursuing an MoE architecture and not a dense one with e.g. x/2 (or x/3) parameters?
Given that we need enough RAM for xB parameter anyway, aren't MoEs at a disadvantage when RAM is scarce, like the current situation?
And thinking of limit cases, is there a limit on x/y, so that it doesn't make sense e.g. to train a 100B1A MoE model?
Thanks.
[link] [comments]