Hybrid on-device inference on Android: llama.cpp + LiteRT + NPU/GPU routing

Hybrid on-device inference on Android: llama.cpp + LiteRT + NPU/GPU routing

Hi everyone,

I’m the maintainer of Box — a fork of Google’s AI Edge Gallery that I’ve been extending into a fully offline AI assistant for Android.

Full disclosure: I built this project.

It runs entirely on-device (no cloud, no accounts, no external inference), and combines multiple local inference backends in a single app.


What I’ve been experimenting with

The goal was to see how far a fully offline mobile AI stack could be pushed using:

  • llama.cpp (GGUF LLM inference)
  • whisper.cpp (on-device STT)
  • stable-diffusion.cpp (image generation)
  • LiteRT (Google’s on-device runtime)

All running on Android with hardware acceleration where available (GPU / NPU / TPU).


Current capabilities

  • Voice-to-voice conversation (streaming style, hands-free loop)
  • Vision + voice (live camera frame + natural language Q&A)
  • On-device image generation (Stable Diffusion via GGUF)
  • Document ingestion into context (local files)
  • Custom GGUF model import
  • Runs across CPU / GPU / NPU / TPU (auto-selected)

Architecture focus

What I’ve found interesting while building this:

  • LiteRT + llama.cpp hybrid inference works better than expected on newer Snapdragon/Pixel NPUs
  • Model routing matters more than raw model size on mobile
  • Whisper.cpp is still the most stable STT layer for fully offline setups
  • Memory + persistence becomes the real bottleneck before compute in many cases

Repo (for reference)

https://github.com/jegly/Box


Why I’m posting this here

I’m mainly sharing this for feedback from people also working on local inference systems, especially around:

  • mobile quantization strategies
  • hybrid runtime routing (CPU/GPU/NPU)
  • multimodal on-device pipelines
  • performance tuning on constrained hardware

Not trying to push adoption — more interested in technical critique than anything else.


Happy to answer questions or go deeper into any part of the stack if useful.

submitted by /u/Healthy_Bedroom5837
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top