Using 100k context with 3090 with MTP GGUF and getting 50 t/s on llama.cpp
Thought I would knowledge share
Use https://huggingface.co/RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF
And am17an commit
/media/adam/D_DRIVE/LLM/llama-cpp-am17an/build/bin/llama-server
-m "/media/Qwen3.6-27B-Q4/Qwen3.6-27B-MTP-Q4_K_M.gguf" \
--ctx-size 100000 \
-ngl 99 -fa on \
--cache-type-k q4_0 --cache-type-v q4_0 \
--batch-size 2048 --ubatch-size 1024 \
--spec-type mtp --spec-draft-n-max 2 \
--flash-attn
Note: Spec draft 3 seemed to much for the 3090 at higher context
Why 100k context? Beside it slows down and 100k is enough for most tasks then compact and continue.
[link] [comments]