Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version)

In my opinion, MTP models are 100% game changer for local LLMs.

In terms of speed, I was getting around 1.5x the tok/sec of previous tests.

The project was a test - building a full iterative step-by-step pygame; a small mystery dungeon-style game. At first I set 100-200k context and raised it to 300k. This is at KV Q8_0 quant. I use VSCodium and Roo. The idea was to see how far I can push the context window and measure (by feel) if a large context window with a multi-file project slows it down too much to be effective.

Model used: Qwen3.6-35B-A3B-UD-Q5_K_S (MTP version) - link

OS/Software: Ubuntu 24.04 - Vulkan - To use MTP I had to use MTP prototype of llama.cpp server (image: havenoammo/llama:vulkan-server)

My current window is 300k context but I feel like I can go even higher as my VRAM used is 28.3gb / 32gb. Likely 400k is viable.

GPU: Asus Radeon R9700 AI Pro card

Just want to shoot my appreciation for the local LLM community and everyone responsible for enabling us to run these kinds of powerful models at home. Amazing when I think where we were just a year ago. I am having a blast exploring all this tech and every day that I learn something new, it just leaves me astounded.

submitted by /u/Jorlen
[link] [comments]

Leave a Comment