The option i see online seem to make the model slower

This are the option I'm currently using, setting parallel at 1, using more draft or adding the draft-min-P at 0.75 seem to not be improving, i have a 5090 and I'm running inside docker, now it runs at 100 tok/s and modifying this option it falls to around 80, what I'm doing wrong?

- "-m" - "/models/Qwen3.6-27B-UD-Q4_K_XL.gguf" - "--n-gpu-layers" - "999" - "--ctx-size" - "162144" - "--spec-type" - "draft-mtp" - "--spec-draft-n-max" - "2" - "--parallel" - "1" - "--cache-type-k" - "q8_0" - "--cache-type-v" - "q8_0" - "--flash-attn" - "on" - "--batch-size" - "2048" - "--cont-batching"

submitted by /u/InternalMode8159
[link] [comments]

Leave a Comment