How does llama-server pick which MoE experts go on the GPU and which stay on the CPU?

If you are using a MoE model that does not fully fit in your GPU, some of the experts must stay on the CPU. Putting the experts that you will actually need on the GPU will give you GPU inference speeds. But guessing entirely incorrectly will only give you CPU inference speeds.

Guessing well is probably easy -- the experts you most commonly used before are the ones that you'll probably need. But I wonder if llama-server uses heuristics like this?

submitted by /u/we_are_mammals
[link] [comments]

Leave a Comment