Introducing Google AI Edge Portal: Benchmark Edge AI at scale. Sign-up to request access during private preview.

Welcome to LiteRT overview

LiteRT is Google's on-device framework for high-performance ML & GenAI deployment on edge platforms, using efficient conversion, runtime, and optimization.

LiteRT 2.x contains the most recent runtime API, the CompiledModel API, that improves upon the Interpreter API (aka TensorFlow Lite APIs), notably through hardware acceleration and overall performance for on-device ML and AI applications. The CompiledModel API is in beta release and available in both Kotlin and C++.

Key LiteRT features

Supported Models

LiteRT Supports streamlined conversion from Pytorch, TensorFlow, and JAX Frameworks to .tflite or .litertlm format.

Streamline development with LiteRT

Automated accelerator selection versus explicit delegate creation. Simple NPU runtime and model distribution. Efficient I/O buffer handling and async execution for superior performance.

Best-in-class GPU performance

Unified NPU acceleration

Accelerate your model using simplified NPU access from major chipset providers.

Simplified LLM Support with LiteRT

You can use LiteRT-LM which is powered by LiteRT. LiteRT-LM supports the latest LLMs and useful features like multi-modality, constrained decoding, etc.

Development workflow

LiteRT runs inferences completely on-device on Android, iOS, Web, IoT, and on Desktop/Laptop. Regardless of device, the following is the most common workflow, with links to further instructions.

LiteRT development workflow graph

Identify the most suitable solution to the ML challenge

LiteRT offers users a high level of flexibility and customizability when it comes to solving machine learning problems, making it a good fit for users who require a specific model or a specialized implementation. Users looking for plug-and-play solutions may prefer MediaPipe Tasks, which provides ready-made solutions for common machine learning tasks like object detection, text classification, and LLM inference.

Obtain and preparing the model

A LiteRT model is represented in an efficient portable format known as FlatBuffers, which uses the .tflite file extension.

You can obtain a LiteRT model in the following ways:

Obtain a pre-trained model: for popular ML workloads like Image segmentation, Object detection etc.

The simplest approach is to use a LiteRT model already in the .tflite format. These models don't require any added conversion steps.

Model Type	Pre-trained Model Source
Classical ML (`.tflite` format)	Visit Kaggle or HuggingFace E.g. Image segmentation models and sample app
Generative AI (`.litertlm` format)	LiteRT Hugging Face page E.g. Gemma Family

Convert your chosen PyTorch, TensorFlow or JAX model into a LiteRT model if you choose to not use a pre-trained model. [PRO USER]

Model Framework Sample Models Conversion Tool

Pytorch Hugging Face
Torchvision Link

TensorFlow Kaggle Models
Hugging Face Link

Jax Hugging Face Link
Author your LLM for further optimization using Generative API [PRO USER]

Our Generative API library provides PyTorch built-in building blocks for composing Transformer models such as Gemma, TinyLlama and others using mobile-friendly abstractions, through which we can guarantee conversion, and performant execution on our mobile runtime, LiteRT. See Generative API documentation.

Model Framework	Sample Models	Conversion Tool
Pytorch	Hugging Face Torchvision	Link
TensorFlow	Kaggle Models Hugging Face	Link
Jax	Hugging Face	Link

Quantization [PRO USER]

AI Edge Quantizer for advanced developers is a tool to quantize converted LiteRT models. It aims to facilitate advanced users to strive for optimal performance on resource demanding models (e.g., GenAI models).

For machine learning models and Generative AI models see Ai-Edge-Quantization documentation.

Prerequisites for getting started

Python 3.10-3.12
.tflite model file
Additional details available in respective platform section

Integrate the model into your app on edge platforms

You can implement your LiteRT models to run inferences completely on-device on Android, iOS, Web, IoT, and Desktop devices. LiteRT contains APIs for Python, Java and Kotlin for Android, Swift for iOS, and C++ for micro-devices.

Use the following guides to implement a LiteRT model on your preferred platform:

Model Framework	Sample Models	Conversion Tool
Run on Android	Android mobile devices	C++/Kotlin APIs
Run on iOS	iOS mobile devices	C++/Swift* APIs
Run on Web using LiteRT.js	Device with Chrome, Firefox, or Safari	JavaScript APIs
Run on Micro	Embedded devices	C++ APIs

*Coming soon

Kotlin code example

// Load model and initialize runtime
val compiledModel = CompiledModel.create("/path/to/mymodel.tflite", CompiledModel.Options(Accelerator.CPU))

// Prepare I/O buffers and fill in the data
val inputBuffers = compiledModel.createInputBuffers()
inputBuffers.get(0).writeFloat(input0)
inputBuffers.get(1).writeFloat(input1)

val outputBuffers = compiledModel.createOutputBuffers()

// Execute model
compiledModel.run(inputBuffers, outputBuffers)

// Access model output
val output = outputBuffers.get(0).readFloat()

inputBuffers.forEach { it.close() }
outputBuffers.forEach { it.close() }
compiledModel.close()

C++ code example

LITERT_ASSIGN_OR_RETURN(auto env, GetEnvironment());
LITERT_ASSIGN_OR_RETURN(auto options, GetOptions());
LITERT_ASSIGN_OR_RETURN(
      auto compiled_model,
      CompiledModel::Create(env, "/path/to/mymodel.tflite", options));
LITERT_ASSIGN_OR_RETURN(auto input_buffers,compiled_model.CreateInputBuffers(signature_index));
LITERT_ASSIGN_OR_RETURN(auto output_buffers,compiled_model.CreateOutputBuffers(signature_index));
LITERT_ABORT_IF_ERROR(input_buffers[0].Write(input0));
LITERT_ABORT_IF_ERROR(input_buffers[1].Write(input1));

LITERT_ABORT_IF_ERROR(compiled_model.Run(signature_index, input_buffers, output_buffers));

LITERT_ABORT_IF_ERROR(output_buffers[0].Read(output0));

Choose a backend

The most straightforward way to incorporate backends in LiteRT is to rely on the runtime's built-in intelligence. With the latest changes, LiteRT simplifies the setup significantly with the ability to specify the target backend as an option.

The core of LiteRT v2.x is the CompiledModel object. When you load a model, LiteRT uses the system's available hardware and internal priority logic to select the optimal backend without manual configuration. See backend for more details.

	Android	Desktop	Web	iOS	macOS	IoT
CPU	XNNPack	XNNPack	XNNPack	XNNPack	XNNPack	XNNPack
GPU	WebGPU OpenCL	WebGPU OpenCL	WebGPU	WebGPU Metal	WebGPU Metal	WebGPU
NPU	MediaTek Qualcomm	-	-	-	-	-

Additional documentation and support

LiteRT v2.x Sample app See LiteRT Image Segmentation Sample App

For existing users of TensorFlow Lite See migration guide

Ops Coverage Compatible operators

Supported LLM Models LiteRT Hugging face and Gen API - samples

Tools LiteRT tools page - Performance, Profiling, error reporting etc.