Quantizing MTP KV Cache = free lunch?

With the MTP llama.cpp implementation in the Qwen3.6/3.5 models more VRAM is required for the MTP layer. However, many people don't realize this layer comes with its own KV cache which can also be quantized:

-cache-type-k-draft q8_0 -cache-type-v-draft q8_0 

So is it free lunch thus allowing us to fit slightly more context?

From a short benchmark on Qwen3.7-27B-Q8_0 it certainly seems so:

--spec-type draft-mtp --spec-draft-n-max 3

Aggregate: { "n_requests": 9, "total_predicted": 1404, "total_draft": 1302, "total_draft_accepted": 957, "aggregate_accept_rate": 0.735, "wall_s_total": 49.46 } 

--spec-type draft-mtp --spec-draft-n-max 3 -cache-type-k-draft q8_0 -cache-type-v-draft q8_0

Aggregate: { "n_requests": 9, "total_predicted": 1404, "total_draft": 1302, "total_draft_accepted": 957, "aggregate_accept_rate": 0.735, "wall_s_total": 49.32 } 

Also tested with tensor parallelism:

-sm tenor --spec-type draft-mtp --spec-draft-n-max 3

Aggregate: { "n_requests": 9, "total_predicted": 1404, "total_draft": 1294, "total_draft_accepted": 959, "aggregate_accept_rate": 0.7411, "wall_s_total": 38.42 } 

-sm tensor --spec-type draft-mtp --spec-draft-n-max 3 -cache-type-k-draft q8_0 -cache-type-v-draft q8_0

Aggregate: { "n_requests": 9, "total_predicted": 1404, "total_draft": 1294, "total_draft_accepted": 959, "aggregate_accept_rate": 0.7411, "wall_s_total": 38.29 } 

Let me know if I'm coping or if you have other experiences.

Tested on 2xMi50 32GBs @ PCIe 4.0 x 8

submitted by /u/legit_split_
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top