Qwen3.6 27B on dual RTX 5060 Ti 16GB with vLLM: ~60 tok/s, 204k context working

I’ve been testing Qwen3.6 27B on a pretty non-standard local setup and figured the numbers might be useful for anyone looking at the newer 16GB Blackwell cards.

Hardware:

  • 2x RTX 5060 Ti 16GB
  • 32GB total VRAM
  • Proxmox LXC
  • 16 vCPU
  • ~60GB RAM
  • CUDA 13 / Torch 2.11 nightly
  • vLLM nightly: 0.19.2rc1.dev
  • Model: sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP

vLLM launch shape:

vllm serve sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP \ --served-model-name qwen36-nvfp4-mtp \ --tensor-parallel-size 2 \ --max-model-len 204800 \ --max-num-batched-tokens 8192 \ --max-num-seqs 1 \ --gpu-memory-utilization 0.95 \ --kv-cache-dtype fp8 \ --quantization modelopt \ --speculative-config '{"method":"mtp","num_speculative_tokens":3}' \ --reasoning-parser qwen3 \ --language-model-only \ --generation-config vllm \ --disable-custom-all-reduce \ --attention-backend TRITON_ATTN 

Performance so far:

  • 8K context, MTP n=1: ~50–52 tok/s
  • 8K context, MTP n=3: ~62–66 tok/s
  • 32K context: ~59–66 tok/s
  • 204800 context starts and works, but is tight
  • Idle VRAM at 204k: ~14.45GiB per GPU
  • After a 168k-token prefill: ~15.65GiB per GPU
  • 168k-token needle/retrieval smoke test passed in ~256s
  • Near-limit test correctly rejected prompt+output over the 204800 window

Thinking mode works too, but you need to give it enough output budget. With low max_tokens, Qwen can spend the whole cap on reasoning and return no final content. Around 1024+ is fine for small prompts, and 4096–8192 is safer for actual reasoning tasks.

Caveats:

  • 204k context is right on the edge with 2x16GB.
  • gpu_memory_utilization=0.94 failed KV allocation; 0.95 worked.
  • Startup takes several minutes due to compile/autotune.
  • Logs show FlashInfer autotuner OOM fallbacks during startup, but the server still becomes healthy.
  • I had better luck with TRITON_ATTN for the text path.
  • This is not a high-concurrency config: max_num_seqs=1.

Overall: dual 5060 Ti 16GB seems surprisingly usable for Qwen3.6 27B if you use the right checkpoint/runtime combo. It’s not roomy, but it works.

submitted by /u/do_u_think_im_spooky
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top