Why people cares token/s in decoding more?

What I've noticed while using local LLM recently is that in most cases, bottlenecks occur not in decoding but in prompt processing.

If the prompt processing speed is usable, in most settings (since it takes about 15k when starting based on agentic coding standard) it exceeds 10 tokens per second in generating, doesn't that exceed the speed we can follow with our eyes?

I tried to use qwen3.6 27b but it took more than 10m to process 64k prompt on my mac mini, so I rather chose 35b a3b

What am I missing? Is the prompt processing speed improved by MTP or other methods?

Or is bottleneck just really different with discrete gpu settings?

submitted by /u/Interesting-Print366
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top