How to run Qwen3.5-27B with speculative decoding with llama.cpp llama-server?

I run it on 2xRTX 3090.

This is part of my llama-server presets file:

[Qwen3.5-27B-bartowski] load-on-startup = true alias = Qwen3.5-27B-bartowski hf = bartowski/Qwen_Qwen3.5-27B-GGUF:Q8_0 hfd = bartowski/Qwen_Qwen3.5-2B-GGUF:Q8_0 draft-min = 1 draft-max = 4 temp = 0.6 top-p = 0.95 top-k = 20 min-p = 0.0 presence-penalty = 0.0 ctx-size = 196608 parallel = 1 fit = true

This is my llama-server start command:

/home/ai/3rdparty/llama.cpp/build/bin/llama-server \ --models-preset /home/ai/llama-server-presets.ini \ --webui-mcp-proxy \ --models-max 1

When I ran it like this, llama-server works as usual, but I see no logs indicating speculative decoding is being used, and I see no speedup.

Yes, I tried hfd = bartowski/Qwen_Qwen3.5-0.8B-GGUF:Q8_0 as well.

UPD.:

Apr 13 14:46:19 builder llama-server[4153398]: [49161] srv load_model: initializing slots, n_slots = 1 Apr 13 14:46:19 builder llama-server[4153398]: [49161] common_speculative_is_compat: the target context does not support partial sequence removal Apr 13 14:46:19 builder llama-server[4153398]: [49161] srv load_model: speculative decoding not supported by this context

submitted by /u/Total_Activity_7550
[link] [comments]

Leave a Comment