Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,…)

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

The following is a non-comprehensive test I came up with to test the quality difference (a.k.a degradation) between different quantizations of Qwen 3.6 27B. I want to figure out what's the best quant to run on my 16 GB VRAM setup.

WHAT WE ARE TESTING

First, the prompt:

Given this PGN string of a chess game: 1. b3 e5 2. Nf3 h5 3. d4 exd4 4. Nxd4 Nf6 5. f4 Ke7 6. Qd3 d5 7. h4 * Figure out the current state of the chessboard, create an image in SVG code, also highlight the last move. 

I want to see if the models can:

  • Able to track the state of the board after each move, to reach the final state (first half of move 7)
  • Generate the right SVG image of the board, correctly place the pieces, highlight the last move

And yes, if you are questioning. It could be possible that the model was trained to do the same thing on existing chess games, so I came up with some random moves, the kind of moves that no players above 300 elo would ever have played.

For those who are not chess players, this is how the board supposed to look like after move 7. h4. Btw, you supposed to look at the pieces positions and the board orientation, not image quality because this is just a screenshot from Lichess.

https://preview.redd.it/6lsfvzy8wfzg1.png?width=1586&format=png&auto=webp&s=94634b461528a6ecc6728eefd23072ab28c3769d

CAN OTHER MODELS SOLVE IT?

Before we go to the main part, let me show the result from some other models. I find it interesting that not many models were able to figure out the board state, let alone rendering it correctly.

Qwen 3.5 27B

It was mostly figured out the final position of the pieces, but still render the original board state on top. Highlighted the wrong squares, and the board orientation is wrong.

https://preview.redd.it/oanbebp9xfzg1.png?width=1078&format=png&auto=webp&s=b72af75a10f4a9f4d897699b404580370bd29d9e

Gemma 4 31B

Nice chess dot com flagship board style, i would say it can figure out the board state, but failed to render it correctly. The square pattern also messed up.

https://preview.redd.it/w5jwi05nxfzg1.png?width=1640&format=png&auto=webp&s=33e6f21f56c4e98df92c828103ac10714e578973

Qwen3 Coder Next

I don't know what to say, quite disappointed.

https://preview.redd.it/knltp8h1yfzg1.png?width=1348&format=png&auto=webp&s=1e9207cd1dfd08b049eaa13727703be732d2cb96

Qwen3.6 35B A3B

As expected, 35B always be the fastest Qwen model, but at the same time, managed to fail the task successfully in many different ways. This is why I decided to find a way to squeeze 27B into my 16 GB card. The speed alone just not worth it.

https://preview.redd.it/orti5kdhyfzg1.png?width=3360&format=png&auto=webp&s=c29a3aae9683e5ceaa15c59ae32adecabdd1b6b6

HOW QWEN3.6 27B SOLVE IT?

All the models here are tested with the same set of llama.cpp parameters:

  • temp 0.6
  • top-p 0.95
  • top-k 20
  • min-p 0.0
  • presence_penalty 1.0
  • context window 65536

BF16 version was from OpenRouter, Q8 to Q4_K_XL versions was on a L40S server, the rest are on my RTX 5060 Ti.

The SVG code generated directly on Llama.cpp Web UI without any tools or MCP enabled (I originally ran this test in Pi agent, only to found out that the model tried to peek into the parent folders and found the existing SVG diagrams by higher quants, copied most of it).

BF16 - Full precision

This is the baseline of this test. It has everything I needed: right position, right board orientation, right piece colors, right highlight. The dotted blue line was unexpected, but it also interesting, because later on you will see, not many of the high quants generate this.

https://preview.redd.it/lgizkjklzfzg1.png?width=1424&format=png&auto=webp&s=d7867b55735d3d875e0e36aecbaf3c3f0d1dbd58

Q8_0

As expected Q8 retains pretty much everything from the full precision except the line.

https://preview.redd.it/6wjnq6ff0gzg1.png?width=1610&format=png&auto=webp&s=f0d20ff4717b972efffced49ac8d43075fa97eb5

Q6_K

We start to see some quality loss here. I mean the placement of the rank 5 pawns. The look of the pieces are mostly because Q6 decided to use a different font. None of the models here trying to draw its own pieces in this test.

https://preview.redd.it/kcqj81vl0gzg1.png?width=1608&format=png&auto=webp&s=66c7a219e79a8f6ecf44e27489f337b4016185b5

Q5_K_XL

Looks very similar with Q8, but it is worth noticing that the SVG code of Q5 version is 7.1 KB, while Q8 is 4.7 KB.

https://preview.redd.it/6wshu7g01gzg1.png?width=1506&format=png&auto=webp&s=289db354fea59c456d8bd2dc7abdbcc1e4282ffd

Q4_K_XL and IQ4_XS

If you ignore the font choice, you will see Q4_K_XL is a more complete solution, because it has the board coordinates.

https://preview.redd.it/pzdghdtm1gzg1.png?width=3326&format=png&auto=webp&s=10c3d7758459f223d195107353f1ec76565cd31d

Q3_K_XL and Q3_K_M

https://preview.redd.it/56gttur62gzg1.png?width=3330&format=png&auto=webp&s=4af27d8a652e2deef6c14485d0fff4bd3651097f

IQ3_XXS

Now here's the interesting part, everything was mostly correct, the piece placements and the highlight, and there's the line on the last move!

But IQ3_XXS get the board orientation wrong, see the light square on the bottom left?

https://preview.redd.it/7jnzxy324gzg1.png?width=1608&format=png&auto=webp&s=178f72f51e65866497f16e861b04c0c448fce774

Q2_K_XL

This is just a waste of time. But hey, it got all the pieces positions right. The board is just not aligned at all.

https://preview.redd.it/3z63d7bv4gzg1.png?width=1604&format=png&auto=webp&s=f6723b28248327c55bede4e42a4a0cfbe962fb74

SO, WHAT DO I USE?

I know a single test is not enough to draw any conclusion here. But personally, I will never go for anything below IQ4_XS after this test (I had bad experience with Q3_K_XL and below in other tries).

On my RTX 5060 Ti, I got like pp 100 tps and tg 8 tps for IQ4_XS with vanilla llama.cpp (q8 for both ctk and ctv, fit on). But with TheTom's TurboQuant fork, I managed to get up to pp 760 tps and tg 22 tps, by forcing GPU offload for all layers (`-ngl 99`), quite usable.

llama-cpp-turboquant/build/bin/llama-server -fa 1 -c 75000 -np 1 --no-mmap --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence_penalty 1.0 -ctk turbo4 -ctv turbo2 -ub 128 -b 256 -m Qwen3.6-27B-IQ4_XS.gguf -ngl 99 

The only down side is I have to keep the context window below 75k, and use turbo4/turbo2 for KV cache quant.

Below are some example of different KV cache quants.

https://preview.redd.it/y0y7o6h09gzg1.png?width=3320&format=png&auto=webp&s=bd7c855100ff63c9bb666a4f4a61b966ad6eebca

https://preview.redd.it/dyrru7z19gzg1.png?width=3314&format=png&auto=webp&s=d54238d7a31c6cd8858f84df67ff588dc22d726b

You can see all the result directly here https://qwen3-6-27b-benchmark.vercel.app/

submitted by /u/bobaburger
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top