Quant Qwen3.6-27B on 16GB VRAM with 100k context length

https://preview.redd.it/tblmrwxkbexg1.png?width=1193&format=png&auto=webp&s=6dea1e6684e75e22852d57c0c72e9171deb56ae2

I have experimented how to run Qwen3.6-27B on my laptop with an A5000 16GB GPU. I have created an own IQ4_XS GGUF "qwen3.6-27b-IQ4_XS-pure.gguf" with the Unsloth imatrix and compared the mean KLD of it with other quants.

You can see that I also have tested different turboquant versions. It looks that the buun-llama-cpp fork is better than the TheTom/llama-cpp-turboquant fork.

If you want to try my version, you can do the following:

Download my GGUF from Huggingface. It already contains an improved chat template base on this one
Clone buun-llama-cpp from https://github.com/spiritbuun/buun-llama-cpp
Build it, I have used on Windows:cmake -B build -G Ninja -DGGML_CUDA=ON -DCMAKE_C_COMPILER=clang-cl -DCMAKE_CXX_COMPILER=clang-cl cmake --build build --config Release -j 16
Check e.g. with nvidia-smi that the GPU VRAM is all free
Run it like, I have used this command:build/bin/llama-server --model qwen3.6-27b-IQ4_XS-pure.gguf --alias qwen3.6-27b -np 1 -ctk turbo3_tcq -ctv turbo3_tcq -c 100000 --fit off -ngl 999 --no-mmap -fa on --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0
To use it on OpenCode, I use this ~/.config/opencode/opencode.json file:

{ "$schema": "https://opencode.ai/config.json", "plugin": [ "opencode-anthropic-auth@latest", "opencode-copilot-auth@latest" ], "share": "disabled", "provider": { "llama.cpp": { "npm": "@ai-sdk/openai-compatible", "name": "llama.cpp (OpenAI Compatible)", "options": { "baseURL": "http://127.0.0.1:8080/v1", "apiKey": "1234" }, "models": { "qwen3.5-27b": { "name": "Qwen 3.5 27B", "interleaved": { "field": "reasoning_content" }, "limit": { "context": 100000, "output": 32000 }, "temperature": true, "reasoning": true, "attachment": false, "tool_call": true, "modalities": { "input": [ "text" ], "output": [ "text" ] }, "cost": { "input": 0, "output": 0, "cache_read": 0, "cache_write": 0 } } } } }, "agent": { "code-reviewer": { "description": "Reviews code for best practices and potential issues", "model": "llama.cpp/qwen3.5-27b", "prompt": "You are a code reviewer. Focus on security, understandability, conciseness, maintainability and performance." }, "plan": { "model": "llama.cpp/qwen3.5-27b" } }, "model": "llama.cpp/qwen3.5-27b", "small_model": "llama.cpp/qwen3.5-27b" }{ "$schema": "https://opencode.ai/config.json", "plugin": [ "opencode-anthropic-auth@latest", "opencode-copilot-auth@latest" ], "share": "disabled", "provider": { "llama.cpp": { "npm": "@ai-sdk/openai-compatible", "name": "llama.cpp (OpenAI Compatible)", "options": { "baseURL": "http://127.0.0.1:8080/v1", "apiKey": "1234" }, "models": { "qwen3.5-27b": { "name": "Qwen 3.5 27B", "interleaved": { "field": "reasoning_content" }, "limit": { "context": 100000, "output": 32000 }, "temperature": true, "reasoning": true, "attachment": false, "tool_call": true, "modalities": { "input": [ "text" ], "output": [ "text" ] }, "cost": { "input": 0, "output": 0, "cache_read": 0, "cache_write": 0 } } } } }, "agent": { "code-reviewer": { "description": "Reviews code for best practices and potential issues", "model": "llama.cpp/qwen3.5-27b", "prompt": "You are a code reviewer. Focus on security, understandability, conciseness, maintainability and performance." }, "plan": { "model": "llama.cpp/qwen3.5-27b" } }, "model": "llama.cpp/qwen3.5-27b", "small_model": "llama.cpp/qwen3.5-27b" }

I get around 21 tokens/s generation speed/ 550 tokens/s prompt processing in the beginning, later it goes down to around 14 tokens/s (485 tokens/s prompt processing) at 15k context.

submitted by /u/Due-Project-7507
[link] [comments]

Leave a Comment