| The angle here is native Windows, no WSL. Simple installation, open source, no telemetry. Not selling or promoting anything: https://github.com/devnen/qwen3.6-windows-server Numbers (RTX 3090, Windows 10): - 72 tok/s short prompt - 64.5 tok/s long prompt (~25k tokens) - 53.4 tok/s at 127k ctx (single GPU) - 160k ctx on PP=2 (2×3090 GPUs) Honestly, these aren't r/LocalLLaMA records. Community has hit 80–82 tok/s on a 3090 with TurboQuant 3-bit KV, and 160 tok/s on a 5090 on Linux. My launcher and patched vLLM closes that gap on Windows. Simple installation: 1. Download I had to build a patched vLLM fork for Windows to fix a few issues and make this work. I am including a portable launcher that ships the prebuilt wheel. First run installs the bundled vLLM wheel + deps into the embedded Python (~5–15 min, one-time), then offers to auto-download the Lorbus AutoRound INT4 quant from HuggingFace if you don't already have it. Subsequent launches skip straight to the TUI. Tested on Windows 10 + 2× RTX 3090 with the Lorbus AutoRound INT4 quant. Should work on any Ampere/Ada/Blackwell card (3090/4090/5090/A6000). Won't work on Pascal, Turing, Arc, or AMD. I have a similar launcher and a patched vLLM for Linux with some very competitive numbers, but it is still a work in progress. If you're on a 3090/4090/5090 on Windows, give it a spin and post your numbers. Full details, patches, benchmarks, and config snapshots: https://github.com/devnen/qwen3.6-windows-server [link] [comments] |