Share your speculative settings for llama.cpp and Gemma4

I have totally missed the boat on speculative decoding.

Today when generating some code again for the frontend i found myself staring down at some quite monotonic javascript code. I decided to give a go at the speculative decoding settings of llama.cpp and was pleasantly surprised as i saw a 15-30% speedup in generation for this exact usecase. The code was an arcade game on canvas (lots of simple fors and if statements for boundary checks and simple game logic, a lot of repetitive input).

The settings that i ended up on using on llama-server were these:

--spec-type ngram-mod --spec-ngram-size-n 18 --draft-min 6 --draft-max 48

The model that i used was Gemma4 26B A4B (unsloth quant). On a "add a feature of 60s comic style text effects like bang or pow text highlights with fading them out to alpha channel" , on a piece of brick breaker game (just for the fun of it i tortured llm to implement it with svg graphics instead of canvas) i got the following output, which i recon is actually decent matching:

draft acceptance rate = 0.76429 ( 2727 accepted / 3568 generated)

statistics ngram_mod: #calls(b,g,a) = 2 7342 80, #gen drafts = 84, #acc drafts = 80, #gen tokens = 3880, #acc tokens = 2768, dur(b,g,a) = 1.765, 23.972, 2.707 ms

slot release: id 3 | task 4678 | stop processing: n_tokens = 23670, truncated = 0

Now a question to fellow coders here: what kind of settings do you use on your gemma4 or qwen3.5 setups, if you make use of them at all. I am running low on VRAM here, hence i don't use a draft model.

submitted by /u/hurdurdur7
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top