Qwen3.6 GGUF Benchmarks

Hey guys, we ran Qwen3.6-35B-A3B GGUF KLD performance benchmarks to help you choose the best quant.

Unsloth quants have the best KLD vs disk space 21/22 times on the pareto frontier.

GGUFs: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

We also want to clear up a few misunderstandings around our GGUF updates. Some people have said we re-upload often because of our own mistakes, or that issues like CUDA 13.2 gibberish are just excuses.

We understand the concern, but the reality is that we tend to publicize issues quickly and tell people to update. In roughly 95% of cases, the root causes were out of our hands - we just try to be transparent and keep the community informed.

A few examples:

Gemma 4 was re-uploaded 4 times

Three were due to about 10 to 20 llama.cpp bug fixes, some of which we helped investigate and contribute a fix as well. The fourth was an official Gemma chat template improvement from Google. Every provider had to update, not just us. See llama.cpp PRs which shows ~30 PR fixes / improvements for Gemma-4

MiniMax 2.7 NaNs

We found NaNs in 38% of Bartowski’s (10/26 quants) and 22% of ours (5/23 quants).

We identified a fix and already patched ours - see https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/ Bartowski has not patched yet, but is actively working on it.

10/26 NaNs (38%) found at https://huggingface.co/bartowski/MiniMaxAI_MiniMax-M2.7-GGUF: Chunk-32 failures (9): IQ3_XXS, IQ3_XS, IQ3_M, Q3_K_M, Q3_K_L, Q3_K_XL, Q4_K_S, Q4_1, Q5_K_S. Late failure (1): IQ1_S (crashed at chunk 311)
5/23 NaNs (21%) ours had NaNs - all fixed now at https://huggingface.co/unsloth/MiniMax-M2.7-GGUF: UD-Q4_K_S, UD-Q4_K_M, UD-Q4_K_XL, UD-Q5_K_S, MXFP4_MOE. All block 32.
AesSedai's Q4_K_M at https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF was re-provided with our Q6_K trick.

Qwen3.5 SSM issues

We shared 7TB of research artifacts showing which layers should not be quantized. The issue was not that providers’ quants were broken, but that they were not optimal - mainly around `ssm_out` and `ssm_*` tensors. We have since improved ours and now lead on KLD vs. disk space for Qwen3.5 as well.

Most if not all quant providers then take our findings then update their quants. We talked about our analysis and research at https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwen3535ba3b_unsloth_dynamic_ggufs_benchmarks/ and https://www.reddit.com/r/LocalLLaMA/comments/1rlkptk/final_qwen35_unsloth_gguf_update/

CUDA 13.2 is actually broken

This causes some low bit quants on all models to get gibberish. Some people have dismissed it as not being an issue, but NVIDIA has confirmed it's a problem and a fix is coming in CUDA 13.3. See Unsloth Issue 4849, llama.cpp issue 21255, issue 21371

As a temporary solution use CUDA 13.1. See https://github.com/ggml-org/llama.cpp/issues/21255#issuecomment-4248403175 quote from https://github.com/johnnynunez:

The bug was found and fixed in cuda 13.3

Thanks again for all the support - we really appreciate it. Hope you all have a great Friday and weekend.

More benchmarks and investigation details here: https://unsloth.ai/docs/models/qwen3.6#unsloth-gguf-benchmarks

submitted by /u/danielhanchen
[link] [comments]

Leave a Comment