I’m trying to find the best llama-server launch command / runtime config for running Qwen3.6 27B GGUF with full GPU offload on ROCm.
I’m currently using the IQ4_XS quant, but I’m not sure if that’s the best option for my setup. This is on Ubuntu, with the display connected to my iGPU, so the RX 7800 XT should have no display overhead. I only have 16 GB DDR4 RAM, which is why I haven’t tried the 35B MoE model.
My goal is to optimize performance in agentic use such as OpenClaw, Hermes Agent, etc. across capability, token generation speed, context length, reliability, and so on...
Current command:
GPU_MAX_HEAP_SIZE=100 \ GPU_MAX_ALLOC_PERCENT=100 \ ./build/bin/llama-server \ -m /home/guy/.cache/huggingface/hub/models--bartowski--Qwen_Qwen3.6-27B-GGUF/snapshots/f73b625d7ceedbd05d14a93874387cd3bcd673b7/Qwen_Qwen3.6-27B-IQ4_XS.gguf \ -ngl 999 \ -c 65536 \ -fa on \ --cache-type-k q4_0 \ --cache-type-v q4_0 \ --parallel 1 \ --prio 2 \ --fit off \ --no-mmap \ -b 65536 \ -ub 512 \ --reasoning-format deepseek \ --temp 0.6 \ --top-k 20 \ --top-p 0.95 \ --min-p 0 \ --presence-penalty 1.5 \ --repeat-penalty 1.0 \ -n 32768 \ --no-context-shift \ [link] [comments]