Quantizing MTP KV Cache = free lunch?

With the MTP llama.cpp implementation in the Qwen3.6/3.5 models more VRAM is required for the MTP layer. However, many people don't realize this layer comes with its own KV cache which can also be quantized:

-cache-type-k-draft q8_0 -cache-type-v-draft q8_0

So is it free lunch thus allowing us to fit slightly more context?

From a short benchmark on Qwen3.7-27B-Q8_0 it certainly seems so:

--spec-type draft-mtp --spec-draft-n-max 3

Aggregate: { "n_requests": 9, "total_predicted": 1404, "total_draft": 1302, "total_draft_accepted": 957, "aggregate_accept_rate": 0.735, "wall_s_total": 49.46 }

--spec-type draft-mtp --spec-draft-n-max 3 -cache-type-k-draft q8_0 -cache-type-v-draft q8_0

Aggregate: { "n_requests": 9, "total_predicted": 1404, "total_draft": 1302, "total_draft_accepted": 957, "aggregate_accept_rate": 0.735, "wall_s_total": 49.32 }

Also tested with tensor parallelism:

-sm tenor --spec-type draft-mtp --spec-draft-n-max 3

Aggregate: { "n_requests": 9, "total_predicted": 1404, "total_draft": 1294, "total_draft_accepted": 959, "aggregate_accept_rate": 0.7411, "wall_s_total": 38.42 }

-sm tensor --spec-type draft-mtp --spec-draft-n-max 3 -cache-type-k-draft q8_0 -cache-type-v-draft q8_0

Aggregate: { "n_requests": 9, "total_predicted": 1404, "total_draft": 1294, "total_draft_accepted": 959, "aggregate_accept_rate": 0.7411, "wall_s_total": 38.29 }

Let me know if I'm coping or if you have other experiences.

Tested on 2xMi50 32GBs @ PCIe 4.0 x 8

submitted by /u/legit_split_
[link] [comments]

Leave a Comment