Speculative decoding in llama.cpp for Gemma 4 31B IT / Qwen 3.5 27B?

Has anyone here tested speculative decoding in llama.cpp with Gemma 4 31B IT or Qwen 3.5 27B?

For Gemma, I was thinking about using a smaller same-family draft model.
For Qwen 3.5, I’m not sure if it works well at all in llama.cpp.

If you tried it, which draft model worked best and did you get a real speedup?

submitted by /u/No_Algae1753
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top