Speculative decoding with Gemma-4-31B + Gemma-4-E2B enables 120 – 200 tok/s output speed for specific tasks

So for my project I was using up until now either Gemini 3 / 2.5 Flash or Flash-lite. All my use cases are not agentic, simply LLM workflows for atomic tasks like extracting references from the law, classifying, adjusting titles to nominative case and so on. All this happens in non-English (LT) language, that's one of the reasons I originally used Google models, as multilingual quality is very great for small base languages.

Each single request usually fits in 2k - 6k tokens context.

Recently I found that at least Gemini 2.5 Flash-lite started to produce horrible results, even started looping which I never experienced before, not sure if coincidence or something changed internally in Vertex API / their models.

Since I have RTX 5090, I decided to give it a try with Gemma 4 31B.

My requirements are quite simple - as good as possible at non-English languages, good at producing structured JSON responses, context up to 8K and output speed as fast as possible.

So to squeeze the best possible quality I tried to run gemma-4-31B-it-GGUF:Q6_K_L + gemma-4-E2B-it-GGUF:Q8_0 speculative decoding.

And well, what I can say at least for my initial small sample testing, I can be sure that quality is better than Gemini 2.5 Flash-lite, it is faster and runs locally. The output speeds I get are around 130 - 200 tok / s which is incredible for the quality I'm getting. Setup uses 31.5 GB of VRAM, which barelly fits into my GPU.

My point is that for lightweight LLM workflows such as data extraction and similar tasks I no longer need Vertex API.

Of course the second step is to try it at larger scale instead of just a few simple tests.

https://preview.redd.it/m9j3wzb2bjxg1.png?width=856&format=png&auto=webp&s=15e6b2db2649e4d49f5bf04b0b0f618482ae88d8

Just wanted to share for others that might have similar use cases - it is worth a try, adding my llama command:

./build/bin/llama-server \ -hf bartowski/google_gemma-4-31B-it-GGUF:Q6_K_L \ -hfd unsloth/gemma-4-E2B-it-GGUF:Q8_0 \ -ngl 99 -ngld 99 -fa 1 \ -c 8192 \ --draft-max 12 --draft-min 2 \ --parallel 1 \ --cache-type-k q8_0 --cache-type-v q8_0 \ --reasoning-budget 0 --no-mmproj \ --host 0.0.0.0 --port 8080 \ --temp 1.0 --top-p 0.95 --top-k 64

submitted by /u/Clasyc
[link] [comments]

Leave a Comment