Preface: I actually write my posts myself, no slop in this post.
I managed to get Qwen 3.5 35BA3B working on my 15" 16GB M3 MBA through mmap, and I must say that given the massive model compared to my ram, 9 TPS is not bad at all.
So, how did I do it?
Step one, download the model itself:
pip3 install huggingface-hub
python3 -c "from huggingface_hub import hf_hub_download; \
hf_hub_download('unsloth/Qwen3.5-35B-A3B-GGUF', \
'Qwen3.5-35B-A3B-UD-IQ3_XXS.gguf', \
local_dir='~/.local/share/llama-models')"
After it has been downloaded, run it through this command:
llama-server \
--model PATH_TO_MODEL
--port 8081 \
--ctx-size 4096 \
--n-gpu-layers 0 \
--parallel 1 \
--mmap \
--flash-attn on \
--threads 6 \
--batch-size 512 \
--ubatch-size 128 \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--no-warmup
Note: You do not need to use the cache type k/v q4, these are here just so if you are doing less serious work, the cache uses less precious vram.
The key here is mmap, it's what allows me to run it in the first place.
Finally, use the model with either API or the llama.cpp webUI!
API: http://127.0.0.1:8081/v1/
WebUI: http://127.0.0.1:8081
If anyone better versed in Llama.cpp can suggest possible improvements for further TPS, please let me know as these are just some that I tried and found worked pretty well.
[link] [comments]