How do I get the superfast DFlash / MTP tokens per second that I’m seeing on here? Dual 3090s

I'm trying to get these high tokens per second that I'm seeing on here using the new speculative decoding techniques.

Hardware: 2x3090, AMD 9900X, 32GB RAM, Gigabyte B850 AI TOP. Running Ubuntu 24.04, CUDA 13.0, NVIDIA-SMI 580.105.08


I'm running a specific forked driver version so that I can get the 3090s to communicate via P2P:

nvidia-smi topo -p2p r

GPU0 GPU1 GPU0 X OK GPU1 OK X 

Legend:

X = Self OK = Status Ok CNS = Chipset not supported GNS = GPU not supported TNS = Topology not supported NS = Not supported U = Unknown


For DFlash:

I followed this readme: https://github.com/Anbeeld/beellama.cpp/blob/main/docs/quickstart-qwen36-dflash.md

I built beellama (with the 3090 params set) and downloaded the recommended spiritbuun draft files and unsloth q5_k_s. Getting around 40t/s.

For MTP:

I built the most recent llama.cpp and tried the MTP versions of Unsloth Qwen3.6 UD-Q4_K_XL and UD-Q8_K_XL. Getting 50ish t/s.

As far as I remember, I was getting 40 t/s on basic Qwen3.5-27B, so where's the 2-3x speed generation.


Here's an example of some of my commands:

from llama.cpp:

build/bin/llama-server \ -m "$HOME/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q8_K_XL.gguf" \ -ngl 99 -c 32000 -fa on -np 1 \ --spec-type draft-mtp --spec-draft-n-max 6 --host 0.0.0.0 \ --port 8082

from llama.cpp:

build/bin/llama-server \ -m "$HOME/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf" \ -ngl 99 -c 245600 -fa on -np 1 \ --spec-type draft-mtp --spec-draft-n-max 6 --host 0.0.0.0 \ --cache-type-k q8_0 --cache-type-v q8_0 \ --flash-attn on \ --cache-ram 0 \ --jinja \ --no-mmap \ --reasoning off \ --port 8082

from beellama:

build/bin/llama-server \ -m "$HOME/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-Q5_K_S.gguf" \ --spec-draft-model "$HOME/.cache/llama.cpp/Qwen3.6/unsloth/dflash-draft-3.6-q4_k_m.gguf" \ --spec-type dflash \ --spec-dflash-cross-ctx 2048 \ --host 0.0.0.0 \ --port 8082 \ -np 1 \ --kv-unified \ -ngl all \ --spec-draft-ngl all \ -b 2048 -ub 512 \ --ctx-size 245600 \ --cache-type-k turbo4 --cache-type-v turbo3_tcq \ --flash-attn on \ --cache-ram 0 \ --jinja \ --no-mmap --mlock \ --no-host --metrics \ --log-timestamps --log-prefix --log-colors off \ --reasoning on \ --chat-template-kwargs '{"preserve_thinking":true}' \ --temp 0.6 --top-k 20 --min-p 0.0

submitted by /u/runcertain
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top