IK_LLAMA now supports Qwen3.5 MTP Support :O

Compile, compile, compile!

https://github.com/ikawrakow/ik_llama.cpp/pull/1698

Will be testing shortly!

EDIT: You will need a GGUF with the MTP layers preserved. The PR creator made some GGUFs of Q3.6 27B at Q8_0 here - https://huggingface.co/Radamanthys11/Qwen3.6-27B-MTP-Q8_0-GGUF

EDIT 2: IT WORKS! Noticeable speed up (EXTRA 10 tok/s) with pipeline parallelism and MTP of draft-max 1. I went from 18-20 t/s to 30 t/s.

Big shoutout to the PR writer, https://github.com/SamuelOliveirads

/home/user/llm/ik_llama.cpp/build/bin/llama-server -m /home/user/llm/models/Qwen3.6-27B/MTP/Qwen3.6-27B-MTP-Q8_0.gguf --port 8080 --host 0.0.0.0 --no-mmap --threads 8 --jinja --cache-ram 65536 --chat-template-kwargs "{"preserve_thinking":true}" --cache-type-k bf16 --cache-type-v bf16 --flash-attn on --merge-qkv --ctx-size 100000 -ngl 99 -np 1 -sm layer -ts 50,50 -dev CUDA0,CUDA1 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 -mtp --draft-max 1 --draft-p-min 0.0

submitted by /u/fragment_me
[link] [comments]

Leave a Comment