Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer

The angle here is native Windows, no WSL. Simple installation, open source, no telemetry. Not selling or promoting anything: https://github.com/devnen/qwen3.6-windows-server

Numbers (RTX 3090, Windows 10): - 72 tok/s short prompt - 64.5 tok/s long prompt (~25k tokens) - 53.4 tok/s at 127k ctx (single GPU) - 160k ctx on PP=2 (2×3090 GPUs)

Honestly, these aren't r/LocalLLaMA records. Community has hit 80–82 tok/s on a 3090 with TurboQuant 3-bit KV, and 160 tok/s on a 5090 on Linux. My launcher and patched vLLM closes that gap on Windows.

Simple installation: 1. Download qwen3.6-windows-server-portable-x64.zip from the Release 2. Unzip anywhere. No admin, no pip, no Python required 3. Double-click start.bat, pick a snapshot, hit Enter 4. OpenAI-compatible endpoint at http://127.0.0.1:5001/v1

I had to build a patched vLLM fork for Windows to fix a few issues and make this work. I am including a portable launcher that ships the prebuilt wheel.

First run installs the bundled vLLM wheel + deps into the embedded Python (~5–15 min, one-time), then offers to auto-download the Lorbus AutoRound INT4 quant from HuggingFace if you don't already have it. Subsequent launches skip straight to the TUI.

Tested on Windows 10 + 2× RTX 3090 with the Lorbus AutoRound INT4 quant. Should work on any Ampere/Ada/Blackwell card (3090/4090/5090/A6000). Won't work on Pascal, Turing, Arc, or AMD.

I have a similar launcher and a patched vLLM for Linux with some very competitive numbers, but it is still a work in progress.

If you're on a 3090/4090/5090 on Windows, give it a spin and post your numbers.

Full details, patches, benchmarks, and config snapshots: https://github.com/devnen/qwen3.6-windows-server

submitted by /u/One_Slip1455
[link] [comments]

Leave a Comment