I run it on 2xRTX 3090.
This is part of my llama-server presets file:
[Qwen3.5-27B-bartowski] load-on-startup = true alias = Qwen3.5-27B-bartowski hf = bartowski/Qwen_Qwen3.5-27B-GGUF:Q8_0 hfd = bartowski/Qwen_Qwen3.5-2B-GGUF:Q8_0 draft-min = 1 draft-max = 4 temp = 0.6 top-p = 0.95 top-k = 20 min-p = 0.0 presence-penalty = 0.0 ctx-size = 196608 parallel = 1 fit = true This is my llama-server start command:
/home/ai/3rdparty/llama.cpp/build/bin/llama-server \ --models-preset /home/ai/llama-server-presets.ini \ --webui-mcp-proxy \ --models-max 1 When I ran it like this, llama-server works as usual, but I see no logs indicating speculative decoding is being used, and I see no speedup.
Yes, I tried hfd = bartowski/Qwen_Qwen3.5-0.8B-GGUF:Q8_0 as well.
UPD.:
Apr 13 14:46:19 builder llama-server[4153398]: [49161] srv load_model: initializing slots, n_slots = 1 Apr 13 14:46:19 builder llama-server[4153398]: [49161] common_speculative_is_compat: the target context does not support partial sequence removal Apr 13 14:46:19 builder llama-server[4153398]: [49161] srv load_model: speculative decoding not supported by this context [link] [comments]