Hybrid on-device inference on Android: llama.cpp + LiteRT + NPU/GPU routing

Hi everyone,

I’m the maintainer of Box — a fork of Google’s AI Edge Gallery that I’ve been extending into a fully offline AI assistant for Android.

Full disclosure: I built this project.

It runs entirely on-device (no cloud, no accounts, no external inference), and combines multiple local inference backends in a single app.

What I’ve been experimenting with

The goal was to see how far a fully offline mobile AI stack could be pushed using:

All running on Android with hardware acceleration where available (GPU / NPU / TPU).

What I’ve found interesting while building this:

LiteRT + llama.cpp hybrid inference works better than expected on newer Snapdragon/Pixel NPUs
Model routing matters more than raw model size on mobile
Whisper.cpp is still the most stable STT layer for fully offline setups
Memory + persistence becomes the real bottleneck before compute in many cases

I’m mainly sharing this for feedback from people also working on local inference systems, especially around:

Not trying to push adoption — more interested in technical critique than anything else.

Happy to answer questions or go deeper into any part of the stack if useful.