MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 – llama.cpp

I was wondering what will be the difference in results with flag: GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 vs MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1

Results are quite interesting 49tok/sec without MTP vs 64 tok/sec with MTP.

PC: RTX5090+128GB DDR5 5600 CL36+Ryzen 9 9950X3D

Model: Qwen3.6-27B-Q8_0.gguf (Unsloth with MTP)

Command:

CUDA_VISIBLE_DEVICES=0 GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 /home/marcin/llama-server \

-m /home/marcin/Pobrane/Qwen3.6-27B-Q8_0.gguf \

--threads 16 \

-c 262144 -fa on -np 1 \

--spec-type mtp --spec-draft-n-max 3 \

--webui-mcp-proxy \

--chat-template-kwargs '{"preserve_thinking": true}' \

--host 0.0.0.0 \

--port 8090 \

--jinja

submitted by /u/mossy_troll_84
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top