Hey folks, looking for advice before I delete or keep a huge model file.
I’m testing local coding/agentic workflows on an RTX 5080 16GB + 96GB RAM. I already have Qwen3.6-35B-A3B-MTP running with llama.cpp MTP branch on Windows native, using CPU expert offload.
Current A3B setup:
Qwen3.6-35B-A3B-MTP Q8_0 GGUF --fit on --fit-target 1536 --n-cpu-moe 34 -c 232144 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --batch-size 2048 --ubatch-size 1024 --cache-ram -1 --checkpoint-every-n-tokens 8192 --spec-type mtp --spec-draft-n-max 2
At my previous ~196K context setting, around 118K active prompt, I was seeing roughly ~1178 tok/s prefill and ~32 tok/s decode. Follow-ups around 118K–143K active prompt were usually ~32–37 tok/s when MTP acceptance was good. DraftN=3 worked, but over-drafted too often at deep context, so DraftN=2 became my stable setting.
Now I’m testing 232K context with the same A3B setup.
I downloaded the new Qwen3.6-27B dense MTP grafted GGUF / UD XL model too, but it’s around 30GB and I only have ~4GB left on my C drive. Before I delete something or keep both, I’m trying to understand if people with similar hardware have actually compared these.
Question: on 16GB VRAM + lots of system RAM, would you keep testing Qwen3.6-27B dense MTP, or stick with Qwen3.6-35B-A3B MoE + CPU expert offload + MTP?
I’m especially interested in real experience at 100K+ active prompt, not just short-prompt tok/s.
Things I’m trying to understand:
- Does 27B dense MTP actually beat 35B-A3B MTP + CPU expert offload on 16GB VRAM?
- At deep context, does dense 27B feel smoother, or does A3B still win because active params are much lower?
- For sustained coding-agent use, is dense consistency better than MoE active-param efficiency?
- If you tested both, which one would you keep if disk space was tight?
I’m not trying to win a benchmark. I care about speed, context, and coding quality for long-running local agent work, tool usage etc.
[link] [comments]