LiteRT provides a unified interface to use Neural Processing Units (NPUs) without forcing you to navigate vendor-specific compilers, runtimes, or library dependencies. Using LiteRT for NPU acceleration boosts performance for real-time and large-model inference and minimizes memory copies through zero-copy hardware buffer usage.
Get Started
To get started, see the NPU overview guide:
- For classical ML models, see the following sections for conversion, compilation, and deployment steps.
- For Large Language Models (LLMs), use our LiteRT-LM framework:
For example implementations of LiteRT with NPU support, refer to the following demo applications:
NPU Vendors
LiteRT supports NPU acceleration with the following vendors:
Qualcomm AI Engine Direct
- AOT and On-Device compilation execution paths are supported through the Compiled Model API.
- See Qualcomm AI Engine Direct for setup details.
MediaTek NeuroPilot
- AOT and JIT execution paths are supported through the Compiled Model API.
- See MediaTek NeuroPilot for setup details.
Convert and compile models for NPU
In order to use NPU acceleration with LiteRT, models must be converted to the LiteRT file format and compiled for on-device NPU usage. You can use the LiteRT AOT (ahead of time) Compiler to compile models into an AI Pack, which bundles your compiled models with device-targeting configurations. This verifies that models are correctly served to devices depending on whether they are equipped or optimized for particular SoCs.
After converting and compiling the models, you can use Play for On-device AI (PODAI) to upload models to Google Play and deliver models to devices through the On-Demand AI framework.
Use the LiteRT AOT compilation notebook for an end-to-end guide to converting and compiling models for NPU.
[AOT only] Deploy with Play AI Pack
After converting the model and compiling an AI Pack, use the following steps to deploy the AI Pack with Google Play.
Import AI Packs into the Gradle project
Copy the AI Pack(s) to the root directory of the Gradle project. For example:
my_app/
...
ai_packs/
my_model/...
my_model_mtk/...
Add each AI Pack to the Gradle build config:
// my_app/ai_packs/my_model/build.gradle.kts
plugins { id("com.android.ai-pack") }
aiPack {
packName = "my_model" // ai pack dir name
dynamicDelivery { deliveryType = "on-demand" }
}
// Add another build.gradle.kts for my_model_mtk/ as well
Add NPU runtime libraries to the project
Download litert_npu_runtime_libraries.zip for AOT or litert_npu_runtime_libraries_jit.zip for JIT, and unpack it in the root directory of the project:
my_app/
...
litert_npu_runtime_libraries/
mediatek_runtime/...
qualcomm_runtime_v69/...
qualcomm_runtime_v73/...
qualcomm_runtime_v75/...
qualcomm_runtime_v79/...
qualcomm_runtime_v81/...
fetch_qualcomm_library.sh
Run the script to download the NPU support libraries. For example, run the following for Qualcomm NPUs:
$ ./litert_npu_runtime_libraries/fetch_qualcomm_library.sh
Add AI Packs and NPU runtime libraries to the Gradle config
Copy device_targeting_configuration.xml from the generated AI Packs to the
directory of the main app module. Then update settings.gradle.kts:
// my_app/setting.gradle.kts
...
// [AOT only]
// AI Packs
include(":ai_packs:my_model")
include(":ai_packs:my_model_mtk")
// NPU runtime libraries
include(":litert_npu_runtime_libraries:runtime_strings")
include(":litert_npu_runtime_libraries:mediatek_runtime")
include(":litert_npu_runtime_libraries:qualcomm_runtime_v69")
include(":litert_npu_runtime_libraries:qualcomm_runtime_v73")
include(":litert_npu_runtime_libraries:qualcomm_runtime_v75")
include(":litert_npu_runtime_libraries:qualcomm_runtime_v79")
include(":litert_npu_runtime_libraries:qualcomm_runtime_v81")
Update build.gradle.kts:
// my_app/build.gradle.kts
android {
...
defaultConfig {
...
// API level 31+ is required for NPU support.
minSdk = 31
// NPU only supports arm64-v8a
ndk { abiFilters.add("arm64-v8a") }
// Needed for Qualcomm NPU runtime libraries
packaging { jniLibs { useLegacyPackaging = true } }
}
// Device targeting
bundle {
deviceTargetingConfig = file("device_targeting_configuration.xml")
deviceGroup {
enableSplit = true // split bundle by #group
defaultGroup = "other" // group used for standalone APKs
}
}
// [AOT Only]
// AI Packs
assetPacks.add(":ai_packs:my_model")
assetPacks.add(":ai_packs:my_model_mtk")
// NPU runtime libraries
dynamicFeatures.add(":litert_npu_runtime_libraries:mediatek_runtime")
dynamicFeatures.add(":litert_npu_runtime_libraries:qualcomm_runtime_v69")
dynamicFeatures.add(":litert_npu_runtime_libraries:qualcomm_runtime_v73")
dynamicFeatures.add(":litert_npu_runtime_libraries:qualcomm_runtime_v75")
dynamicFeatures.add(":litert_npu_runtime_libraries:qualcomm_runtime_v79")
dynamicFeatures.add(":litert_npu_runtime_libraries:qualcomm_runtime_v81")
}
dependencies {
// Dependencies for strings used in the runtime library modules.
implementation(project(":litert_npu_runtime_libraries:runtime_strings"))
...
}
[AOT only] Use on-demand deployment
With the Android AI Pack feature configured in the build.gradle.kts file,
check the device capabilities and use NPU on capable devices, using GPU and CPU
as a fallback:
val env = Environment.create(BuiltinNpuAcceleratorProvider(context))
val modelProvider = AiPackModelProvider(
context, "my_model", "model/my_model.tflite") {
if (NpuCompatibilityChecker.Qualcomm.isDeviceSupported())
setOf(Accelerator.NPU) else setOf(Accelerator.CPU, Accelerator.GPU)
}
val mtkModelProvider = AiPackModelProvider(
context, "my_model_mtk", "model/my_model_mtk.tflite") {
if (NpuCompatibilityChecker.Mediatek.isDeviceSupported())
setOf(Accelerator.NPU) else setOf()
}
val modelSelector = ModelSelector(modelProvider, mtkModelProvider)
val model = modelSelector.selectModel(env)
val compiledModel = CompiledModel.create(
model.getPath(),
CompiledModel.Options(model.getCompatibleAccelerators()),
env,
)
Create CompiledModel for JIT mode
val env = Environment.create(BuiltinNpuAcceleratorProvider(context))
val compiledModel = CompiledModel.create(
"model/my_model.tflite",
CompiledModel.Options(Accelerator.NPU),
env,
)
Inference on NPU using LiteRT in Kotlin
To get started using the NPU accelerator, pass the NPU parameter when creating
the Compiled Model (CompiledModel).
The following code snippet shows a basic implementation of the entire process in Kotlin:
val inputBuffers = model.createInputBuffers()
val outputBuffers = model.createOutputBuffers()
inputBuffers[0].writeFloat(FloatArray(data_size) { data_value })
model.run(inputBuffers, outputBuffers)
val outputFloatArray = outputBuffers[0].readFloat()
inputBuffers.forEach { it.close() }
outputBuffers.forEach { it.close() }
model.close()
Inference on NPU using LiteRT in C++
Build dependencies
C++ users must build the dependencies of the application with LiteRT NPU
acceleration. The cc_binary rule that packages the core application logic
(e.g., main.cc) requires the following runtime components:
- LiteRT C API shared library: the
dataattribute must include the LiteRT C API shared library (//litert/c:litert_runtime_c_api_shared_lib) and the vendor-specific dispatch shared object for the NPU (//litert/vendors/qualcomm/dispatch:dispatch_api_so). - NPU-specific backend libraries: For example, the Qualcomm AI RT (QAIRT)
libraries for the Android host (like
libQnnHtp.so,libQnnHtpPrepare.so) and the corresponding Hexagon DSP library (libQnnHtpV79Skel.so). This ensures that the LiteRT runtime can offload computations to the NPU. - Attribute dependencies: the
depsattribute links against essential compile-time dependencies, such as LiteRT's tensor buffer (//litert/cc:litert_tensor_buffer) and the API for the NPU dispatch layer (//litert/vendors/qualcomm/dispatch:dispatch_api). This enables your application code to interact with the NPU through LiteRT. - Model files and other assets: Included through the
dataattribute.
This setup allows your compiled binary to dynamically load and use the NPU for accelerated machine learning inference.
Set up an NPU environment
Some NPU backends require runtime dependencies or libraries. When using compiled
model API, LiteRT organizes these requirements through an Environment object.
Use the following code to find the appropriate NPU libraries or drivers:
// Provide a dispatch library directory (following is a hypothetical path) for the NPU
std::vector<Environment::Option> environment_options = {
{
Environment::OptionTag::DispatchLibraryDir,
"/usr/lib64/npu_dispatch/"
}
};
LITERT_ASSIGN_OR_RETURN(auto env, Environment::Create(absl::MakeConstSpan(environment_options)));
Runtime integration
The following code snippet shows a basic implementation of the entire process in C++:
// 1. Load the model that has NPU-compatible ops
LITERT_ASSIGN_OR_RETURN(auto model, Model::Load("mymodel_npu.tflite"));
// 2. Create a compiled model with NPU acceleration
// See following section on how to set up NPU environment
LITERT_ASSIGN_OR_RETURN(auto compiled_model,
CompiledModel::Create(env, model, kLiteRtHwAcceleratorNpu));
// 3. Allocate I/O buffers
LITERT_ASSIGN_OR_RETURN(auto input_buffers, compiled_model.CreateInputBuffers());
LITERT_ASSIGN_OR_RETURN(auto output_buffers, compiled_model.CreateOutputBuffers());
// 4. Fill model inputs (CPU array -> NPU buffers)
float input_data[] = { /* your input data */ };
input_buffers[0].Write<float>(absl::MakeConstSpan(input_data, /*size*/));
// 5. Run inference
compiled_model.Run(input_buffers, output_buffers);
// 6. Access model output
std::vector<float> data(output_data_size);
output_buffers[0].Read<float>(absl::MakeSpan(data));
Zero-copy with NPU acceleration
Using zero-copy enables an NPU to access data directly in its own memory without the need for the CPU to explicitly copy that data. By not copying data to and from CPU memory, zero-copy can significantly reduce end-to-end latency.
The following code is an example implementation of Zero-Copy NPU with
AHardwareBuffer, passing data directly to the NPU. This implementation avoids
expensive round-trips to CPU memory, significantly reducing inference overhead.
// Suppose you have AHardwareBuffer* ahw_buffer
LITERT_ASSIGN_OR_RETURN(auto tensor_type, model.GetInputTensorType("input_tensor"));
LITERT_ASSIGN_OR_RETURN(auto npu_input_buffer, TensorBuffer::CreateFromAhwb(
env,
tensor_type,
ahw_buffer,
/* offset = */ 0
));
std::vector<TensorBuffer> input_buffers{npu_input_buffer};
LITERT_ASSIGN_OR_RETURN(auto output_buffers, compiled_model.CreateOutputBuffers());
// Execute the model
compiled_model.Run(input_buffers, output_buffers);
// Retrieve the output (possibly also an AHWB or other specialized buffer)
auto ahwb_output = output_buffers[0].GetAhwb();
Chain multiple NPU inferences
For complex pipelines, you can chain multiple NPU inferences. Since each step uses an accelerator-friendly buffer, your pipeline stays mostly in NPU-managed memory:
// compiled_model1 outputs into an AHWB
compiled_model1.Run(input_buffers, intermediate_buffers);
// compiled_model2 consumes that same AHWB
compiled_model2.Run(intermediate_buffers, final_outputs);
NPU just-in-time compilation caching
LiteRT supports NPU just-in-time (JIT) compilation of .tflite models. JIT
compilation can be especially useful in situations where compiling the model
ahead of time is not feasible.
JIT compilation however can come with some latency and memory overhead to translate the user-provided model into NPU bytecode instructions on-demand. To minimize the performance impact NPU compilation artifacts can be cached.
When caching is enabled LiteRT will only trigger the re-compilation of the model when required, e.g.:
- The vendor's NPU compiler plugin version changed;
- The Android build fingerprint changed;
- The user-provided model changed;
- The compilation options changed.
In order to enable NPU compilation caching, specify the CompilerCacheDir
environment tag in the environment options. The value must be set to an
existing writable path of the application.
const std::array environment_options = {
litert::Environment::Option{
/*.tag=*/litert::Environment::OptionTag::CompilerPluginLibraryDir,
/*.value=*/kCompilerPluginLibSearchPath,
},
litert::Environment::Option{
litert::Environment::OptionTag::DispatchLibraryDir,
kDispatchLibraryDir,
},
// 'kCompilerCacheDir' will be used to store NPU-compiled model
// artifacts.
litert::Environment::Option{
litert::Environment::OptionTag::CompilerCacheDir,
kCompilerCacheDir,
},
};
// Create an environment.
LITERT_ASSERT_OK_AND_ASSIGN(
auto environment, litert::Environment::Create(environment_options));
// Load a model.
auto model_path = litert::testing::GetTestFilePath(kModelFileName);
LITERT_ASSERT_OK_AND_ASSIGN(auto model,
litert::Model::CreateFromFile(model_path));
// Create a compiled model, which only triggers NPU compilation if
// required.
LITERT_ASSERT_OK_AND_ASSIGN(
auto compiled_model, litert::CompiledModel::Create(
environment, model, kLiteRtHwAcceleratorNpu));
Example latency and memory savings:
The time and memory required for NPU compilation can vary based on several factors, like the underlying NPU chip, the complexity of the input model etc.
The following table compares the runtime initialization time and memory consumption when NPU compilation is required versus when compilation can be skipped due to caching. On one sample device we obtain the following:
| TFLite model | model init with NPU compilation | model init with cached compilation | init memory footprint with NPU compilation | init memory with cached compilation |
|---|---|---|---|---|
| torchvision_resnet152.tflite | 7465.22 ms | 198.34 ms | 1525.24 MB | 355.07 MB |
| torchvision_lraspp_mobilenet_v3_large.tflite | 1592.54 ms | 166.47 ms | 254.90 MB | 33.78 MB |
On another device we obtain the following:
| TFLite model | model init with NPU compilation | model init with cached compilation | init memory footprint with NPU compilation | init memory with cached compilation |
|---|---|---|---|---|
| torchvision_resnet152.tflite | 2766.44 ms | 379.86 ms | 653.54 MB | 501.21 MB |
| torchvision_lraspp_mobilenet_v3_large.tflite | 784.14 ms | 231.76 ms | 113.14 MB | 67.49 MB |