Qwen 3.6 27b IQ4_XS – 22 tp/s on RTX 5060TI 16b, 24k ctx

Maybe it be helpful for someone:
llama-server -m '/Qwen3.6-27B/Qwen3.6-27B-IQ4_XS.gguf' -ngl 999 -ctk q4_0 -ctv q4_0 -b 128 -ub 128 -c 24000

Cant run this model with higher kv quants on >8192ctx size.
-ub & -b setted for 256 allowed me for max 16384 ctx

The max sized for ctx i get is 24k. Disabled gnome let me use additional 300MiB.

Its kinda nice, but ik that is very low usefull in many case.

This GPU load 63/65 layers in this quants without quant context. But its still q4 so i think that is good enough.

I used unsloth quant: https://huggingface.co/unsloth/Qwen3.6-27B-GGUF?show_file_info=Qwen3.6-27B-IQ4_XS.gguf

submitted by /u/BazzyIm
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top