It’s crazy how we have so many great models and technics that it’s turning into a complex optimization problem to find the perfect model, quant, kv cache quant for my system.

For instance, I have a single 3090ti and 128GB DDR4 Ram, I appreciate good speed(+20 t/s) and context size(+100k).

I have these options from just

Qwen 3.5 27B

Qwen 3.5 35B MOE

Qwen coder 80B

Gemma 4 31B

Gemma 4 26B MOE

...and whole lot more options

Just want a good model overally that's smart and will mostly use it for coding.

Appreciate intelligence over all other metrics.

Here is what I have so far.

- I am thinking Q4 quant for model weights since this was deemed a while ago "optimal"(I believe even apple said its mobile llms were about this level). But the real world is never that easy, confusingly some are saying UD IQ3_XXS is really good in their testing for the 31B Gemma4 model.

- q8 for kv cache because with the last "attn-rot" PR merged into llama.cpp, it seemed like the KLD was pretty much the same with F16 in their testing.

Can anyone help a brother out?

submitted by /u/takuonline
[link] [comments]

Leave a Comment