Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

If anyone is looking for a good high-speed setup with ~190k context, this config has been working insanely well for me.

I’m using my laptop as a server over Tailscale. Installed Linux on it and running:

- Qwen3.6 35B A3B

- RTX 4060 8GB VRAM

- 32GB DDR5 5600MHz RAM

- Q5 quant models

Current models tested:

- `mudler/Qwen3.6-35B-A3B-APEX-GGUF`

- ~40 tok/sec → 37 tok/sec

- `hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF`

- ~43 tok/sec → 37 tok/sec

I can push it up to ~51 tok/sec by tweaking:

- `--ctx-size 192640`

- `--n-gpu-layers 430`

- `--n-cpu-moe 35`

and adjusting those values slightly higher/lower depending on stability and memory usage.

Here’s my current config:

#!/bin/bash

# --- LLAMA SERVER LAUNCHER SCRIPT ---

#SELECTED_MODEL="/home/atulloq/.lmstudio/models/hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled.Q5_K_M.gguf"

SELECTED_MODEL="/home/atulloq/.lmstudio/models/mudler/Qwen3.6-35B-A3B-APEX-GGUF/Qwen3.6-35B-A3B-APEX-I-Balanced.gguf"

echo "Starting Llama Server..."

echo "Model: $SELECTED_MODEL"

/home/atulloq/llama-cpp-turboquant/build/bin/llama-server \

--model "$SELECTED_MODEL" \

--host 0.0.0.0 \

--port 8085 \

--ctx-size 192640 \

--n-gpu-layers 430 \

--n-cpu-moe 35 \

--cache-type-k "turbo4" \

--cache-type-v "turbo4" \

--flash-attn on \

--batch-size 2048 \

--parallel 1 \

--no-mmap \

--mlock \

--ubatch-size 512 \

--threads 6 \

--cont-batching \

--timeout 300 \

--temp 0.2 \

--top-p 0.95 \

--min-p 0.05 \

--top-k 20 \

--metrics \

--chat-template-kwargs '{"preserve_thinking": true}'

I’m using this fork of llama.cpp with TurboQuant support:

https://github.com/TheTom/turboquant_plus#build-llamacpp-with-turboquant

A few honest notes:

- Q4 is noticeably worse for long-context reasoning compared to Q5 on these models.

- `--no-mmap` + `--mlock` helped reduce weird slowdowns for me.

- TurboQuant KV cache makes a massive difference at high context sizes.

- Linux performs way better than Windows for this setup.

- Don’t expect these speeds if your RAM bandwidth is bad. DDR5 matters here.

If anyone has optimizations for:

- better long-context stability,

- higher token throughput,

- or smarter `n-cpu-moe` tuning,

I’d love to test them.

submitted by /u/Atul_Kumar_97
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top