With the MTP llama.cpp implementation in the Qwen3.6/3.5 models more VRAM is required for the MTP layer. However, many people don't realize this layer comes with its own KV cache which can also be quantized:
-cache-type-k-draft q8_0 -cache-type-v-draft q8_0 So is it free lunch thus allowing us to fit slightly more context?
From a short benchmark on Qwen3.7-27B-Q8_0 it certainly seems so:
--spec-type draft-mtp --spec-draft-n-max 3
Aggregate: { "n_requests": 9, "total_predicted": 1404, "total_draft": 1302, "total_draft_accepted": 957, "aggregate_accept_rate": 0.735, "wall_s_total": 49.46 } --spec-type draft-mtp --spec-draft-n-max 3 -cache-type-k-draft q8_0 -cache-type-v-draft q8_0
Aggregate: { "n_requests": 9, "total_predicted": 1404, "total_draft": 1302, "total_draft_accepted": 957, "aggregate_accept_rate": 0.735, "wall_s_total": 49.32 } Also tested with tensor parallelism:
-sm tenor --spec-type draft-mtp --spec-draft-n-max 3
Aggregate: { "n_requests": 9, "total_predicted": 1404, "total_draft": 1294, "total_draft_accepted": 959, "aggregate_accept_rate": 0.7411, "wall_s_total": 38.42 } -sm tensor --spec-type draft-mtp --spec-draft-n-max 3 -cache-type-k-draft q8_0 -cache-type-v-draft q8_0
Aggregate: { "n_requests": 9, "total_predicted": 1404, "total_draft": 1294, "total_draft_accepted": 959, "aggregate_accept_rate": 0.7411, "wall_s_total": 38.29 } Let me know if I'm coping or if you have other experiences.
Tested on 2xMi50 32GBs @ PCIe 4.0 x 8
[link] [comments]