| What I used
InstallationI downloaded the "Termux" app from the Google Play store and installed the needed tools in Termux: Downloading a modelI downloaded Qwen3.5-0.8B-Q5_K_M.gguf in my phone browser and saved it to my device. Then I opened the download folder shortcut in the browser, selected the GGUF file -> open with: Termux Now the file is accessible in Termux. Running it in the terminalAfter that, I loaded the model and started chatting through the command line. Running it in the browserI also tried to run the model in llama-server, which gives a more readable UI in your web browser, while Termux is running in the background. To do this, run the below command to start a local server and open it in the browser by writing localhost:8080 or 127.0.0.1:8080 in the address bar. With the previous command I had only achieved 3-4 TPS, and just by adding the parameter "-t 6", which dedicates 6 threads of the CPU for inference, output increased to 7-8 TPS. This is to show that there is potential to increase generation speed with various parameters. ConclusionRunning an open source LLM on my phone like this was a fun experience, especially considering it is a 2021 device, so newer phones should offer an even more enjoyable experience. This is by no means a guide on how to do it best, as I have done only surface level testing. There are various parameters that can be adjusted, depending on your device, to increase TPS and achieve a more optimal setup. Maybe this has motivated you to try this on your phone and I hope you find some of this helpful! [link] [comments] |