It started saying 4/5x speed advantage against usual bf16 models (test are less optimistic but let think this is true).
Then MoE gain is not that good, value was for dense models.
Then quantization greatly reduces the gain, Q8_0 still gains, Q4_0 not much.
Then multi-user/stream speed-gain decrease with number of users, halved in 2, 20% in 4, 0% in 8.
Finally, this all is for very short context, so that there's another drop at higher context.
Practically, regular user usage (customer pc 8/16 GB VRAM) will get not much gain (if any) due to 2-1-4
and mini-server use will get not much gain (if any) due to 2-1-3 and partially 4.
I'd say to stop the optimism about it, and wait to see if DDTree has better/more consistent results.
[link] [comments]