Help my llm isn’t llming

Long story short, for some reasons Q4 and Q6 seem to be taking the same amount of RAM on my Macbook air M2 16GB? And also the same generation speed? I'm a beginner with little knowledge about this, and I hope some kind souls here can save me.

here are some stats.

models: unsloth Qwen3.5 9B UD-Q4_K_XL (5.97GB) and unsloth Qwen3.5 9B Q6_K (7.46)

temp 0.8
top-k 40
top-p 0.95
they, along with other stats, are all defaults of llama.cpp

I sudo purged every time before switching to the next model, turned off all windows except terminal and activity monitor, and made sure there's no swapping.

Memory it's using is in the pictures. The right one is the window of activity monitor, and I circled the "memory used."

For some additional data, here are the llama_memory_breakdown_print of Q4 and Q6, both after running for about 2.5 minutes, generating about 1425 and 1380 tokens each (time*t/s, a rough estimation). I changed the format a bit to make it more understandable.

Q4:

| memory breakdown [MiB] | total free self model context compute unaccounted |

| - MTL0 (Apple M2) | 12124 = 690 + (11433 = 5679 + 5178 + 575) + 0 |

| - Host | 882 = 545 + 0 + 336 |

Q6:

| memory breakdown [MiB] | total free self model context compute unaccounted |

| - MTL0 (Apple M2) | 12124 = 477 + (11645 = 7102 + 4050 + 493) + 0 |

| - Host | 1061 = 795 + 0 + 266 |

submitted by /u/Nicking0413
[link] [comments]

Leave a Comment