[cupel] M5 Max 128GB: Qwen3.5-397B IQ2 @ 29 tokens per second

A year ago I would just read about 397B league of models. Today I can run it on my laptop. The combination of importance matrix (imatrix) with Unsloth's per-model adaptive layer quantization is what makes it all possible. But I didn't start with 397B, I started with 17 smaller models..

There were a lot of great feedback from "M5 Max 128GB, 17 models, 23 prompts: Qwen 3.5 122B is still a local king" discussion.

I used Gemma 4 to organize all the feedback into actions, and Gemma and I created the list to work on to address the feedback and the asks: https://github.com/tolitius/cupel/issues/1

One of the ask was to take "Qwen3.5-397B-A17B-UD-IQ2_XXS" for a spin on the M5 Max 128G MacBook. These Unsloth ("UD") models are really interesting because different layers are quantized differently. On top of the the most important ("I") weights are rounded to minimize their loss / error.

After downloading Qwen 397B, before doing anything else I wanted to understand what it is I am going to ask my laptop to swallow:

$ ll -h ~/.llama.cpp/models/Qwen3.5-397B-A17B-UD-IQ2_XXS/UD-IQ2_XXS total 224361224 -rw-r--r-- 1 user staff 10M Apr 12 18:50 Qwen3.5-397B-A17B-UD-IQ2_XXS-00001-of-00004.gguf -rw-r--r-- 1 user staff 46G Apr 12 20:12 Qwen3.5-397B-A17B-UD-IQ2_XXS-00003-of-00004.gguf -rw-r--r-- 1 user staff 14G Apr 12 20:57 Qwen3.5-397B-A17B-UD-IQ2_XXS-00004-of-00004.gguf -rw-r--r-- 1 user staff 46G Apr 12 21:12 Qwen3.5-397B-A17B-UD-IQ2_XXS-00002-of-00004.gguf

Now I knew it is 106GB. The original 16bit model is 807GB, if it was "just" quantized to 2bit model it would take (397B * 2 bits) / 8 = ~99 GB, but I am looking at 106GB, so I wanted to look under the hood to see the actual quanization recipe Unsloth team followed:

$ gguf-dump \ ~/.llama.cpp/models/Qwen3.5-397B-A17B-UD-IQ2_XXS/UD-IQ2_XXS/Qwen3.5-397B-A17B-UD-IQ2_XXS-00002-of-00004.gguf \ 2>&1 | head -200

Tensor type	Quant	Bits	Role
`ffn_gate_exps`	IQ2_XXS	~2.06	512 routed experts gate (bulk of model)
`ffn_up_exps`	IQ2_XXS	~2.06	512 routed experts up (bulk of model)
`ffn_down_exps`	IQ2_S	~2.31	512 routed experts down (one step higher)
`ffn_gate_shexp`	Q5_K	5.5	shared expert gate
`ffn_up_shexp`	Q5_K	5.5	shared expert up
`ffn_down_shexp`	Q6_K	6.56	shared expert down
`attn_gate` / `attn_qkv`	Q5_K	5.5	GatedDeltaNet attention (linear attn layers)
`attn_q` / `attn_k` / `attn_v` / `attn_output`	Q5_K	5.5	full attention layers (every 4th)
`ssm_out`	Q6_K	6.56	GatedDeltaNet output (most sensitive)
`ssm_alpha` / `ssm_beta`	Q8_0	8.0	GatedDeltaNet gates
`ssm_conv1d` / `ssm_a` / `ssm_dt` / `ssm_norm`	F32	32	small tensors, kept full precision
`ffn_gate_inp` (router)	F32	32	MoE router weights
`token_embd` / `output`	Q4_K	4.5	embedding and lm_head
norms	F32	32	all normalization weights

super interesting. the expert tensors (ffn_gate_exps, ffn_up_exps and ffn_down_exps) are quantized at ~2 bits, but the rest are much larger. This is where the 7GB difference (99GB vs. 106GB) really pays off: 7GB of packed intelligence on top of expert tensors.

trial by fire

By trial and error I found that 16K for the context would be a sweet spot for the 128GB unified memory. but the GPU space needs to be moved up a little to fit it (it is around 96GB by default):

$ sudo sysctl iogpu.wired_limit_mb=122880

"llama.cpp" would be the best choice to run this model (since MLX does not quantize to IQ2_XXS):

$ llama-server \ -m ~/.llama.cpp/models/Qwen3.5-397B-A17B-UD-IQ2_XXS/UD-IQ2_XXS/Qwen3.5-397B-A17B-UD-IQ2_XXS-00001-of-00004.gguf \ --n-gpu-layers 99 \ --ctx-size 16384 \ --temp 1.0 --top-p 0.95 --top-k 20

My current use case, as I described in the previous reddit discussion, is finding the best model assembly to help me making sense of my kids school work and progress since if anything is super messy in terms of organization, variety of disconnected systems where the kids data lives, communication inconsistencies, that would be US public schools. A small army of Claude Sonnets does it well'ish, but it is really expensive, hence "Qwen3.5 397B" could be just a drop in replacement (that's the hope)

In order to make sense of which local models "do good" I used cupel: https://github.com/tolitius/cupel, and that is the next step: fire it up and test "Qwen3.5 397B" on muti-turn, tool use, etc.. tasks:

https://preview.redd.it/hoy0uqr75yug1.png?width=2476&format=png&auto=webp&s=0caab1625168f52c74244175843644a600edcf28

And, after all the tests I found "Qwen3.5 397B IQ2" to be.. amazing. Even at 2 bits, it is extremely intelligent, and is able to call tools, pass context between turns, organize very messy set of tables into clean aggregates, etc.

It is on par with "Qwen 3.5 122B 4bit", but I suspect I need to work on more exquisite prompts to distill the difference.

What surprised me the most is the 29 tokens per second average generation speed:

prompt eval time = 269.46 ms / 33 tokens ( 8.17 ms per token, 122.46 tokens per second) eval time = 79785.85 ms / 2458 tokens ( 32.46 ms per token, 30.81 tokens per second) total time = 80055.31 ms / 2491 tokens slot release: id 1 | task 7953 | stop processing: n_tokens = 2490, truncated = 0 srv update_slots: all slots are idle

this is one of the examples from 'llama.cpp". the prompt processing depends on batching and ranged from 80 tokens per second to 330 tokens per second

The disadvantages I can see so far:

Can't really efficiently run it in the assembly, since it is the only model that can be loaded / fits. with 122B (65GB) I can still run more models side by side
I don't expect it to handle large context well due to hardware memory limitation
Theoretically it would have a worse time dealing with a very specialized knowledge where a specific expert is needed, but its weights are "too crushed" to give a clean answer. But, just maybe, the "I" in "IQ2-XXS" makes sure that the important weights stay very close to their original value
Under load I saw the speed dropping from 30 to 17 tokens per second. I suspect it is caused by the prompt cache filling up and triggering evictions, but needs more research

But.. 512 experts, 397B of stored knowledge, 17B active parameters per token and all that at 29 tokens per second on a laptop.

submitted by /u/tolitius
[link] [comments]

trial by fire

Leave a Comment