Qwen 3.6 27b IQ4_XS – 22 tp/s on RTX 5060TI 16b, 24k ctx

Maybe it be helpful for someone:
llama-server -m '/Qwen3.6-27B/Qwen3.6-27B-IQ4_XS.gguf' -ngl 999 -ctk q4_0 -ctv q4_0 -b 128 -ub 128 -c 24000

Cant run this model with higher kv quants on >8192ctx size.
-ub & -b setted for 256 allowed me for max 16384 ctx

The max sized for ctx i get is 24k. Disabled gnome let me use additional 300MiB.

Its kinda nice, but ik that is very low usefull in many case.

This GPU load 63/65 layers in this quants without quant context. But its still q4 so i think that is good enough.

submitted by /u/BazzyIm
[link] [comments]