Qwen 3.6 27B BF16 vs Q4_K_M vs Q8_0 GGUF evaluation

Qwen 3.6 27B BF16 vs Q4_K_M vs Q8_0 GGUF evaluation

Evaluated Qwen 3.6 27B across BF16, Q4_K_M, and Q8_0 GGUF quant variants with llama-cpp-python using Neo AI Engineer.

Benchmarks used:

  • HumanEval: code generation
  • HellaSwag: commonsense reasoning
  • BFCL: function calling

Total samples:

  • HumanEval: 164
  • HellaSwag: 100
  • BFCL: 400

Results:

BF16

  • HumanEval: 56.10% 92/164
  • HellaSwag: 90.00% 90/100
  • BFCL: 63.25% 253/400
  • Avg accuracy: 69.78%
  • Throughput: 15.5 tok/s
  • Peak RAM: 54 GB
  • Model size: 53.8 GB

Q4_K_M

  • HumanEval: 50.61% 83/164
  • HellaSwag: 86.00% 86/100
  • BFCL: 63.00% 252/400
  • Avg accuracy: 66.54%
  • Throughput: 22.5 tok/s
  • Peak RAM: 28 GB
  • Model size: 16.8 GB

Q8_0

  • HumanEval: 52.44% 86/164
  • HellaSwag: 83.00% 83/100
  • BFCL: 63.00% 252/400
  • Avg accuracy: 66.15%
  • Throughput: 18.0 tok/s
  • Peak RAM: 42 GB
  • Model size: 28.6 GB

What stood out:

Q4_K_M looks like the best practical variant here. It keeps BFCL almost identical to BF16, drops about 5.5 points on HumanEval, and is still only 4 points behind BF16 on HellaSwag.

The tradeoff is pretty good:

  • 1.45x faster than BF16
  • 48% less peak RAM
  • 68.8% smaller model file
  • nearly identical function calling score

Q8_0 was a bit underwhelming in this run. It improved HumanEval over Q4_K_M by ~1.8 points, but used 42 GB RAM vs 28 GB and was slower. It also scored lower than Q4_K_M on HellaSwag in this eval.

For local/CPU deployment, I would probably pick Q4_K_M unless the workload is heavily code-generation focused. For maximum quality, BF16 still wins.

Evaluation setup:

  • GGUF via llama-cpp-python
  • n_ctx: 32768
  • checkpointed evaluation
  • HumanEval, HellaSwag, and BFCL all completed
  • BFCL had 400 function calling samples

This evaluation was done using Neo AI Engineer, which built the GGUF eval setup, handled checkpointed runs, and consolidated the benchmark results. I manually reviewed the outcome as well.

Complete case study with benchmarking results, approach and code snippets in mentioned in the comments below 👇

submitted by /u/gvij
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top