Llama.cpp quantization is broken

Main reason is, that qunatization quality directly affects models performance and stability and this results in real usefullness. Even though GRM-2.6-Plus is in benchmarks better than qwen3.6 27b model from which it derives, it gives worse results than autoround Q2_K_mixed quant of qwen3.6 27b which is practically same in size.

This is just one example, most of the quants i tested suffer from same problems and only few of them mostly with different quantization mechanism are usefull below Q5.

I want to advocate for autoround quantization as standard for lower quants Q1-Q4, also apex was performing quite well, but size is larger, maybe you know of other alternative methods that give consistent results, because standard quants like Q4_K_M dont provide adequate results and often results in bugged behavior overall (looping, halucinations, inconsistency).

Prompt: Create svg image of a pelican riding a bicycle

Multiple examples of different quant results

https://www.reddit.com/r/LocalLLaMA/comments/1szp96f/comment/oj3r4b1/

Autoround Q2_K_Mixed https://huggingface.co/sphaela/Qwen3.6-27B-AutoRound-GGUF

https://preview.redd.it/mn93lh9bz2zg1.png?width=875&format=png&auto=webp&s=fb39e93521c5f382c6438308e0f07fff21bb05d9

Regular llama.cpp Q4_K_M https://huggingface.co/morikomorizz/GRM-2.6-Plus-GGUF

https://preview.redd.it/b0gigcm7z2zg1.png?width=700&format=png&auto=webp&s=aa826be7b07e2b4ef9a89bbea3443f992d3c41c3

This is just one example and the output quality is consistently worse, when i ask it tricky questions, how much it hallucinates, loops etc.

Community should understand, that typical quantization under Q5-6 is inadequate for qwen models unless you tinker with it through some more intelligent mechanism like intel autoround does.

Looping from my experience is for example direct symptom of broken quantization, occasional syntactic errors in agentic coding another.

submitted by /u/Ok-Importance-3529
[link] [comments]

Leave a Comment