For instance, I have a single 3090ti and 128GB DDR4 Ram, I appreciate good speed(+20 t/s) and context size(+100k).
I have these options from just
Qwen 3.5 27B
Qwen 3.5 35B MOE
Qwen coder 80B
Gemma 4 31B
Gemma 4 26B MOE
...and whole lot more options
Just want a good model overally that's smart and will mostly use it for coding.
Appreciate intelligence over all other metrics.
Here is what I have so far.
- I am thinking Q4 quant for model weights since this was deemed a while ago "optimal"(I believe even apple said its mobile llms were about this level). But the real world is never that easy, confusingly some are saying UD IQ3_XXS is really good in their testing for the 31B Gemma4 model.
- q8 for kv cache because with the last "attn-rot" PR merged into llama.cpp, it seemed like the KLD was pretty much the same with F16 in their testing.
Can anyone help a brother out?
[link] [comments]