I'm trying to get these high tokens per second that I'm seeing on here using the new speculative decoding techniques.
Hardware: 2x3090, AMD 9900X, 32GB RAM, Gigabyte B850 AI TOP. Running Ubuntu 24.04, CUDA 13.0, NVIDIA-SMI 580.105.08
I'm running a specific forked driver version so that I can get the 3090s to communicate via P2P:
nvidia-smi topo -p2p r
GPU0 GPU1 GPU0 X OK GPU1 OK X Legend:
X = Self OK = Status Ok CNS = Chipset not supported GNS = GPU not supported TNS = Topology not supported NS = Not supported U = Unknown
For DFlash:
I followed this readme: https://github.com/Anbeeld/beellama.cpp/blob/main/docs/quickstart-qwen36-dflash.md
I built beellama (with the 3090 params set) and downloaded the recommended spiritbuun draft files and unsloth q5_k_s. Getting around 40t/s.
For MTP:
I built the most recent llama.cpp and tried the MTP versions of Unsloth Qwen3.6 UD-Q4_K_XL and UD-Q8_K_XL. Getting 50ish t/s.
As far as I remember, I was getting 40 t/s on basic Qwen3.5-27B, so where's the 2-3x speed generation.
Here's an example of some of my commands:
from llama.cpp:
build/bin/llama-server \ -m "$HOME/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q8_K_XL.gguf" \ -ngl 99 -c 32000 -fa on -np 1 \ --spec-type draft-mtp --spec-draft-n-max 6 --host 0.0.0.0 \ --port 8082
from llama.cpp:
build/bin/llama-server \ -m "$HOME/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf" \ -ngl 99 -c 245600 -fa on -np 1 \ --spec-type draft-mtp --spec-draft-n-max 6 --host 0.0.0.0 \ --cache-type-k q8_0 --cache-type-v q8_0 \ --flash-attn on \ --cache-ram 0 \ --jinja \ --no-mmap \ --reasoning off \ --port 8082
from beellama:
build/bin/llama-server \ -m "$HOME/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-Q5_K_S.gguf" \ --spec-draft-model "$HOME/.cache/llama.cpp/Qwen3.6/unsloth/dflash-draft-3.6-q4_k_m.gguf" \ --spec-type dflash \ --spec-dflash-cross-ctx 2048 \ --host 0.0.0.0 \ --port 8082 \ -np 1 \ --kv-unified \ -ngl all \ --spec-draft-ngl all \ -b 2048 -ub 512 \ --ctx-size 245600 \ --cache-type-k turbo4 --cache-type-v turbo3_tcq \ --flash-attn on \ --cache-ram 0 \ --jinja \ --no-mmap --mlock \ --no-host --metrics \ --log-timestamps --log-prefix --log-colors off \ --reasoning on \ --chat-template-kwargs '{"preserve_thinking":true}' \ --temp 0.6 --top-k 20 --min-p 0.0
[link] [comments]