Qwen3.6-27B-INT4 clocking 100 tps with 256k context length on 1x RTX 5090 via vllm 0.19

Thanks to the community the Qwen3.6-27B speed keeps getting better. The following improves upon my recipe from yesterday and delivered a whopping 100+ tps (TG).

Model: https://huggingface.co/Lorbus/Qwen3.6-27B-int4-AutoRound

- MTP supported

- KLD is decent (much better than NVFP4 per the linked post) with the benefit of being the smallest model

- The smaller model size allows for full native 256k context window

Tokens per second (TG): 105-108 tps

Special credits to this post that helps me discover the Lorbus quant: https://www.reddit.com/r/Olares/comments/1svg2ad/qwen3627b_at_85100_ts_on_a_24gb_rtx_5090_laptop/

Note that I didn't mess with TQ in my setup as I can already hit the max context length native to the model without it.

Vllm launch config:

args=(

vllm serve "/root/autodl-tmp/llm-models"

--max-model-len "262144"

--gpu-memory-utilization "0.93"

--attention-backend flashinfer

--performance-mode interactivity

--language-model-only

--kv-cache-dtype "fp8_e4m3"

--max-num-seqs "2"

--skip-mm-profiling

--quantization auto_round

--reasoning-parser qwen3

--enable-auto-tool-choice

--enable-prefix-caching

--enable-chunked-prefill

--tool-call-parser qwen3_coder

--speculative-config '{"method":"mtp","num_speculative_tokens":3}'

--host "0.0.0.0"

--port "6006"

)

submitted by /u/Kindly-Cantaloupe978
[link] [comments]

Leave a Comment