why llama.cpp can’t combine speculative decode methods?

dicking around with the new mtp speculative decode with qwen3.6 27b, and it’s great. but for agentic coding i’ve seen significant improvements from ngram, because a decent fraction of the time (e.g. calling edit tool) the model is just repeating verbatim a section of code that it has already seen before. ngram can speculate on a lot of tokens reeaallly fast in comparison.

it’d be great if we could combine them by using them both at the same time, but it looks like if i add them both to the command line arguments, only ngram is active.

is there any reason both can’t be used simultaneously? fundamental limitation, or just an implementation limit with a fix on the horizon?

EDIT: just looked at the PR again and PmNz8 asked the same question like two hours before i posted this. go give it an updoot! https://github.com/ggml-org/llama.cpp/pull/22673 #issuecomment-4394 5 448002

submitted by /u/Qwoctopussy
[link] [comments]

Leave a Comment