| I've been developing an agentic AI app for the past few months and during the process decided that I needed to use a coherent frontier-class LLM on a mobile device. It's been a long and frustrating journey, but I finally made the breakthrough tonight: stable 1.5 t/s speeds on an iPhone Air using a fully decomposed Qwen35-397B-A17B model at Q4. The 1.5 tok/s should go up over the coming days as I optimize expert usage per layer and remove the tons of logging in each run. I regularly get faster speeds of 5.4 t/s, but the response degrades to "the capital of France is the capital of France", so I focused initially on getting the correct response through accurate expert selection. I'm also going to expand the questions to verify coherence across topics and context. This same model is running at ~11 tok/s on my Air M5 (16GB RAM). The hard part was getting the model decomposed and streaming in a way that balanced speed of response, data loading per token, and output quality. For those who don't know, the iPhone Air and 17 Pros have an ~8.2GB metal kernel memory limit - fitting the core components of an LLM into that space at usable speeds was incredibly difficult. I'm not quite ready to tell the world how this was done (there are a lot of implications for having a frontier-class model in everyone's pocket), but I'm more than happy to talk about some of the approaches it took to get here. It was not easy and I am grateful to the llama.cpp people for creating a fantastic jumping-off point, but nothing out there is built for true streaming inference (ggml has way to many micro-operations) so there's a lot that had to be discovered and a lot of custom work that had to be done. But I do want to let everyone know that the future of inference is at the edge, not the data center. Free tokens for all... Since I'm sure there are a ton of people who going to try and call BS, I'm more than happy to talk about some of the technical or show a demo video of this being run in real-time. [link] [comments] |