| I'm running the 122-billion Qwen 3.5, specifically I'm (very!) impressed with the general knowledge output. I can talk to it in multiple languages, and don't feel the need to consult online frontier models for any encyclopaedic, general "handyman" or other day-to-day questions. My local Qwen seems sufficient. This said, the output seems slow, around 19 tokens/s. Is this speed expected? I'm running the model from llama-server (latest compile as of yesterday), and the chat UI is Open WebUI. Are there any speed optimizations I can make in this setup without compromising the quality of output/
[link] [comments] |