LiteRT-LM is a production-ready, open-source inference framework designed to deliver high-performance, cross-platform LLM deployments on edge devices.
Key Features
- Cross-Platform Support: Run on Android, iOS, Web, and Desktop.
- Hardware Acceleration:
- GPU: Powered by ML Drift, supporting both ML and Generative AI models.
- NPU: Accelerated inference on devices with Qualcomm and MediaTek chipsets (Early Access).
- Multi-Modality: Vision and Audio input support.
- Tool Use: Function calling support for agentic workflows.
- Broad Model Support: Run Gemma, Llama, Phi-4, Qwen and more.
Supported Backends & Platforms
| Platform | CPU Support | GPU Support | NPU Support |
|---|---|---|---|
| Android | ✅ | ✅ | ✅ |
| iOS | ✅ | ✅ | - |
| macOS | ✅ | ✅ | - |
| Windows | ✅ | ✅ | - |
| Linux | ✅ | ✅ | - |
| Embedded | ✅ | - | - |
Quick Start
Want to try it out first? Before proceeding with the full setup, you can use the pre-built binaries for desktop or the Google AI Edge Gallery app for mobile to run LiteRT-LM immediately.
Mobile Apps
The Google AI Edge Gallery is a demo app that puts the power of cutting-edge Generative AI models directly into your hands, powered by LiteRT-LM.
Desktop CLI
After downloading the lit binary, just run lit to see the options.
Choose Your Platform
| Language | Status | Best For... | Documentation |
|---|---|---|---|
| Kotlin | ✅ Stable |
Native Android apps and JVM-based desktop tools. Optimized for Coroutines. | Kotlin API Reference |
| C++ | ✅ Stable |
High-performance, cross-platform core logic and embedded systems. | C++ API Reference |
| Swift | 🚀 In Dev |
Native iOS and macOS integration with specialized Metal support. | Coming Soon |
| Python | 🚀 In Dev |
Rapid prototyping, development, and desktop-side scripting. | Coming Soon |
Supported Models
The following table shows a sampling of models that are fully supported and tested with LiteRT-LM.
Note: "Chat Ready" indicates models tuned for chat (instruction tuning). "Base" models often require fine-tuning for optimal chat performance unless used for specific completions.
| Model | Type | Quantization | Context Length | Size (MB) | Download |
|---|---|---|---|---|---|
| Gemma | |||||
| Gemma3-1B | Chat Ready | 4-bit per-channel | 4096 | 557 | Download |
| Gemma-3n-E2B | Chat Ready | 4-bit per-channel | 4096 | 2965 | Download |
| Gemma-3n-E4B | Chat Ready | 4-bit per-channel | 4096 | 4235 | Download |
| FunctionGemma-270M | Base (Fine-tuning required) | 8-bit per-channel | 1024 | 288 | Fine-tuning Guide |
| ↪ TinyGarden-270M | Demo | 8-bit per-channel | 1024 | 288 | Download / Try App |
| Llama | |||||
| Llama-3.2-1B-Instruct | Chat Ready | 8-bit per-channel | 8192 | 1162 | Download |
| Llama-3.2-3B-Instruct | Chat Ready | 8-bit per-channel | 8192 | 2893 | Download |
| Phi | |||||
| phi-4-mini | Chat Ready | 8-bit per-channel | 4096 | 3728 | Download |
| Qwen | |||||
| qwen2.5-1.5b | Chat Ready | 8-bit per-channel | 4096 | 1524 | Download |
Performance
Below are the performance numbers of running each model on various devices. Note that the benchmark is measured with 1024 tokens prefill and 256 tokens decode ( with performance lock on Android devices).
| Model | Device | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | Context size |
|---|---|---|---|---|---|
| Gemma3-1B | MacBook Pro (2023 M3) |
CPU | 423 | 67 | 4096 |
| Gemma3-1B | Samsung S24 (Ultra) |
CPU | 243 | 44 | 4096 |
| Gemma3-1B | Samsung S24 (Ultra) |
GPU | 1877 | 45 | 4096 |
| Gemma3-1B | Samsung S25 (Ultra) |
NPU | 5837 | 85 | 1280 |
| Gemma-3n-E2B | MacBook Pro (2023 M3) |
CPU | 233 | 28 | 4096 |
| Gemma-3n-E2B | Samsung S24 (Ultra) |
CPU | 111 | 16 | 4096 |
| Gemma-3n-E2B | Samsung S24 (Ultra) |
GPU | 816 | 16 | 4096 |
| Gemma-3n-E4B | MacBook Pro (2023 M3) |
CPU | 170 | 20 | 4096 |
| Gemma-3n-E4B | Samsung S24 (Ultra) |
CPU | 74 | 9 | 4096 |
| Gemma-3n-E4B | Samsung S24 (Ultra) |
GPU | 548 | 9 | 4096 |
| FunctionGemma | Samsung S25 (Ultra) |
CPU | 1718 | 126 | 1024 |
Note: The first time a given model is loaded on a given device, it will take longer to load as weights are optimized. Subsequent loads will be faster due to caching.
Model Hosting and Deployment
When a model exceeds the "over-the-air" download limits (often around 1.5GB), a remote fetch strategy is required.
- Firebase: Recommended for downloading large files on Android and iOS.
- HuggingFace API: Fetch models directly using the HuggingFace API.
Reporting Issues
If you encounter a bug or have a feature request, please use the LiteRT-LM GitHub Issues page.