I was wondering what will be the difference in results with flag: GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 vs MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
Results are quite interesting 49tok/sec without MTP vs 64 tok/sec with MTP.
PC: RTX5090+128GB DDR5 5600 CL36+Ryzen 9 9950X3D
Model: Qwen3.6-27B-Q8_0.gguf (Unsloth with MTP)
Command:
CUDA_VISIBLE_DEVICES=0 GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 /home/marcin/llama-server \
-m /home/marcin/Pobrane/Qwen3.6-27B-Q8_0.gguf \
--threads 16 \
-c 262144 -fa on -np 1 \
--spec-type mtp --spec-draft-n-max 3 \
--webui-mcp-proxy \
--chat-template-kwargs '{"preserve_thinking": true}' \
--host 0.0.0.0 \
--port 8090 \
--jinja
[link] [comments]