Qwen 3.6 27B BF16 vs Q4_K_M vs Q8_0 GGUF evaluation

Evaluated Qwen 3.6 27B across BF16, Q4_K_M, and Q8_0 GGUF quant variants with llama-cpp-python using Neo AI Engineer.

Benchmarks used:

HumanEval: code generation
HellaSwag: commonsense reasoning
BFCL: function calling

Total samples:

HumanEval: 164
HellaSwag: 100
BFCL: 400

Results:

BF16

HumanEval: 56.10% 92/164
HellaSwag: 90.00% 90/100
BFCL: 63.25% 253/400
Avg accuracy: 69.78%
Throughput: 15.5 tok/s
Peak RAM: 54 GB
Model size: 53.8 GB

Q4_K_M

HumanEval: 50.61% 83/164
HellaSwag: 86.00% 86/100
BFCL: 63.00% 252/400
Avg accuracy: 66.54%
Throughput: 22.5 tok/s
Peak RAM: 28 GB
Model size: 16.8 GB

Q8_0

HumanEval: 52.44% 86/164
HellaSwag: 83.00% 83/100
BFCL: 63.00% 252/400
Avg accuracy: 66.15%
Throughput: 18.0 tok/s
Peak RAM: 42 GB
Model size: 28.6 GB

What stood out:

Q4_K_M looks like the best practical variant here. It keeps BFCL almost identical to BF16, drops about 5.5 points on HumanEval, and is still only 4 points behind BF16 on HellaSwag.

The tradeoff is pretty good:

1.45x faster than BF16
48% less peak RAM
68.8% smaller model file
nearly identical function calling score

Q8_0 was a bit underwhelming in this run. It improved HumanEval over Q4_K_M by ~1.8 points, but used 42 GB RAM vs 28 GB and was slower. It also scored lower than Q4_K_M on HellaSwag in this eval.

For local/CPU deployment, I would probably pick Q4_K_M unless the workload is heavily code-generation focused. For maximum quality, BF16 still wins.

Evaluation setup:

GGUF via llama-cpp-python
n_ctx: 32768
checkpointed evaluation
HumanEval, HellaSwag, and BFCL all completed
BFCL had 400 function calling samples

This evaluation was done using Neo AI Engineer, which built the GGUF eval setup, handled checkpointed runs, and consolidated the benchmark results. I manually reviewed the outcome as well.

Complete case study with benchmarking results, approach and code snippets in mentioned in the comments below 👇

submitted by /u/gvij
[link] [comments]

Leave a Comment