LiteRT-LM Overview

LiteRT-LM is a production-ready, open-source inference framework designed to deliver high-performance, cross-platform LLM deployments on edge devices.

Key Features

  • Cross-Platform Support: Run on Android, iOS, Web, and Desktop.
  • Hardware Acceleration:
    • GPU: Powered by ML Drift, supporting both ML and Generative AI models.
    • NPU: Accelerated inference on devices with Qualcomm and MediaTek chipsets (Early Access).
  • Multi-Modality: Vision and Audio input support.
  • Tool Use: Function calling support for agentic workflows.
  • Broad Model Support: Run Gemma, Llama, Phi-4, Qwen and more.

Supported Backends & Platforms

Platform CPU Support GPU Support NPU Support
Android
iOS -
macOS -
Windows -
Linux -
Embedded - -

Quick Start

Want to try it out first? Before proceeding with the full setup, you can use the pre-built binaries for desktop or the Google AI Edge Gallery app for mobile to run LiteRT-LM immediately.

Mobile Apps

The Google AI Edge Gallery is a demo app that puts the power of cutting-edge Generative AI models directly into your hands, powered by LiteRT-LM.

Desktop CLI

After downloading the lit binary, just run lit to see the options.

Choose Your Platform

Language Status Best For... Documentation
Kotlin
Stable
Native Android apps and JVM-based desktop tools. Optimized for Coroutines. Kotlin API Reference
C++
Stable
High-performance, cross-platform core logic and embedded systems. C++ API Reference
Swift 🚀
In Dev
Native iOS and macOS integration with specialized Metal support. Coming Soon
Python 🚀
In Dev
Rapid prototyping, development, and desktop-side scripting. Coming Soon

Supported Models

The following table shows a sampling of models that are fully supported and tested with LiteRT-LM.

Note: "Chat Ready" indicates models tuned for chat (instruction tuning). "Base" models often require fine-tuning for optimal chat performance unless used for specific completions.

Model Type Quantization Context Length Size (MB) Download
Gemma
Gemma3-1B Chat Ready 4-bit per-channel 4096 557 Download
Gemma-3n-E2B Chat Ready 4-bit per-channel 4096 2965 Download
Gemma-3n-E4B Chat Ready 4-bit per-channel 4096 4235 Download
FunctionGemma-270M Base (Fine-tuning required) 8-bit per-channel 1024 288 Fine-tuning Guide
↪ TinyGarden-270M Demo 8-bit per-channel 1024 288 Download / Try App
Llama
Llama-3.2-1B-Instruct Chat Ready 8-bit per-channel 8192 1162 Download
Llama-3.2-3B-Instruct Chat Ready 8-bit per-channel 8192 2893 Download
Phi
phi-4-mini Chat Ready 8-bit per-channel 4096 3728 Download
Qwen
qwen2.5-1.5b Chat Ready 8-bit per-channel 4096 1524 Download

Performance

Below are the performance numbers of running each model on various devices. Note that the benchmark is measured with 1024 tokens prefill and 256 tokens decode ( with performance lock on Android devices).

Model Device Backend Prefill (tokens/sec) Decode (tokens/sec) Context size
Gemma3-1B MacBook Pro
(2023 M3)
CPU 423 67 4096
Gemma3-1B Samsung S24
(Ultra)
CPU 243 44 4096
Gemma3-1B Samsung S24
(Ultra)
GPU 1877 45 4096
Gemma3-1B Samsung S25
(Ultra)
NPU 5837 85 1280
Gemma-3n-E2B MacBook Pro
(2023 M3)
CPU 233 28 4096
Gemma-3n-E2B Samsung S24
(Ultra)
CPU 111 16 4096
Gemma-3n-E2B Samsung S24
(Ultra)
GPU 816 16 4096
Gemma-3n-E4B MacBook Pro
(2023 M3)
CPU 170 20 4096
Gemma-3n-E4B Samsung S24
(Ultra)
CPU 74 9 4096
Gemma-3n-E4B Samsung S24
(Ultra)
GPU 548 9 4096
FunctionGemma Samsung S25
(Ultra)
CPU 1718 126 1024

Note: The first time a given model is loaded on a given device, it will take longer to load as weights are optimized. Subsequent loads will be faster due to caching.

Model Hosting and Deployment

When a model exceeds the "over-the-air" download limits (often around 1.5GB), a remote fetch strategy is required.

  • Firebase: Recommended for downloading large files on Android and iOS.
  • HuggingFace API: Fetch models directly using the HuggingFace API.

Reporting Issues

If you encounter a bug or have a feature request, please use the LiteRT-LM GitHub Issues page.