Hello everyone, i am banging my head trying to properly configure qwen 3.6 27b mtp in vllm.
I am using vllm v0.20.0 in docker, unquantized model with tp4 (4 3090s), max context length.
At low context size, mtp with value of 3 gives the best results: 48-50 tps generation speed. However, once the context gets larger (> 70/80k) i the tps drops to 15-20 tps.
Without mtp i start from 30tps and degrades to 26-27 tps at large context.
For now i disabled it since i am testing agentic coding and even if i try to keep the context size bellow 50% (120-130k) i still go over 70k pretty often.
Any advice will be welcomed.
[link] [comments]