Just wanted to share because it took me a lot of tweaking to get here:
llama-server -hf unsloth/MiniMax-M2.7-GGUF:UD-IQ3_XXS --temp 1.0 --top-k 40 --top-p 0.95 --host 0.0.0.0 --port 8080 -c 100000 -fa on -ngl 999 --no-context-shift -fit off --no-mmap -np 2 --kv-unified --cache-ram 0 -b 1024 -ub 1024 --cache-reuse 256
Reasoning behind the various options
--no-context-shift I want to know when I run out of context instead of silently corrupting stuff
--no-mmap Recommended by Donato
-np 2 Retain context for up to two concurrent sessions
--kv-unified Make the two session share the same cache to save vram
--cache-ram 0 Do not swap cache to ram, stays in vram instead. This solved a lot of OOMs for me.
-b 1024 -ub 1024 Improve prefill performance.
--cache-reuse 256 Attempt to reuse cache "smartly". This sometimes helps avoid having to reprocess cache but also sometimes hurts, so use at your own discretion.
Additional setup
Headless Fedora Linux according to Donato's setup guides (but sans-toolbox). I also recommend increasing your swap size and setting OOMScoreAdjust=500 in your systemd service file, otherwise, you risk the oom killer killing important things if you do run out of ram.
Intelligence
I've found minimax to be great at coding but not necessarily as "well rounded" as Qwen3.6 27b. It's not as strong at coding architecture discussions or code review. Qwen may also be stronger at non-coding stuff.
Where minimax shines is in coding "intuition", it "just gets you". When Qwen would take things too literally or fail to get the gist of things, Minimax better understands "intent". It may also have more "knowledge" than Qwen 27b due to having more parameters.
Performance
https://preview.redd.it/695zwpa6660h1.png?width=1000&format=png&auto=webp&s=c4a584f1aa9e2e8c406f44194097f66ce86cce13
https://preview.redd.it/2ojq0ts7660h1.png?width=1000&format=png&auto=webp&s=029f583fb4344be00c3681cf3a24722cf59123c7
submitted by