NPU acceleration with LiteRT

LiteRT provides a unified interface to use Neural Processing Units (NPUs) without forcing you to navigate vendor-specific compilers, runtimes, or library dependencies. Using LiteRT for NPU acceleration boosts performance for real-time and large-model inference and minimizes memory copies through zero-copy hardware buffer usage.

Get Started

To get started, see the NPU overview guide:

For example implementations of LiteRT with NPU support, refer to the following demo applications:

NPU Vendors

LiteRT supports NPU acceleration with the following vendors:

Qualcomm AI Engine Direct

  • AOT and On-Device compilation execution paths are supported through the Compiled Model API.
  • See Qualcomm AI Engine Direct for setup details.

MediaTek NeuroPilot

  • AOT and JIT execution paths are supported through the Compiled Model API.
  • See MediaTek NeuroPilot for setup details.

Convert and compile models for NPU

In order to use NPU acceleration with LiteRT, models must be converted to the LiteRT file format and compiled for on-device NPU usage. You can use the LiteRT AOT (ahead of time) Compiler to compile models into an AI Pack, which bundles your compiled models with device-targeting configurations. This verifies that models are correctly served to devices depending on whether they are equipped or optimized for particular SoCs.

After converting and compiling the models, you can use Play for On-device AI (PODAI) to upload models to Google Play and deliver models to devices through the On-Demand AI framework.

Use the LiteRT AOT compilation notebook for an end-to-end guide to converting and compiling models for NPU.

[AOT only] Deploy with Play AI Pack

After converting the model and compiling an AI Pack, use the following steps to deploy the AI Pack with Google Play.

Import AI Packs into the Gradle project

Copy the AI Pack(s) to the root directory of the Gradle project. For example:

my_app/
    ...
    ai_packs/
        my_model/...
        my_model_mtk/...

Add each AI Pack to the Gradle build config:

// my_app/ai_packs/my_model/build.gradle.kts

plugins { id("com.android.ai-pack") }

aiPack {
  packName = "my_model"  // ai pack dir name
  dynamicDelivery { deliveryType = "on-demand" }
}

// Add another build.gradle.kts for my_model_mtk/ as well

Add NPU runtime libraries to the project

Download litert_npu_runtime_libraries.zip for AOT or litert_npu_runtime_libraries_jit.zip for JIT, and unpack it in the root directory of the project:

my_app/
    ...
    litert_npu_runtime_libraries/
        mediatek_runtime/...
        qualcomm_runtime_v69/...
        qualcomm_runtime_v73/...
        qualcomm_runtime_v75/...
        qualcomm_runtime_v79/...
        qualcomm_runtime_v81/...
        fetch_qualcomm_library.sh

Run the script to download the NPU support libraries. For example, run the following for Qualcomm NPUs:

$ ./litert_npu_runtime_libraries/fetch_qualcomm_library.sh

Add AI Packs and NPU runtime libraries to the Gradle config

Copy device_targeting_configuration.xml from the generated AI Packs to the directory of the main app module. Then update settings.gradle.kts:

// my_app/setting.gradle.kts

...

// [AOT only]
// AI Packs
include(":ai_packs:my_model")
include(":ai_packs:my_model_mtk")

// NPU runtime libraries
include(":litert_npu_runtime_libraries:runtime_strings")

include(":litert_npu_runtime_libraries:mediatek_runtime")
include(":litert_npu_runtime_libraries:qualcomm_runtime_v69")
include(":litert_npu_runtime_libraries:qualcomm_runtime_v73")
include(":litert_npu_runtime_libraries:qualcomm_runtime_v75")
include(":litert_npu_runtime_libraries:qualcomm_runtime_v79")
include(":litert_npu_runtime_libraries:qualcomm_runtime_v81")

Update build.gradle.kts:

// my_app/build.gradle.kts

android {
 ...

 defaultConfig {
    ...

    // API level 31+ is required for NPU support.
    minSdk = 31

    // NPU only supports arm64-v8a
    ndk { abiFilters.add("arm64-v8a") }
    // Needed for Qualcomm NPU runtime libraries
    packaging { jniLibs { useLegacyPackaging = true } }
  }

  // Device targeting
  bundle {
      deviceTargetingConfig = file("device_targeting_configuration.xml")
      deviceGroup {
        enableSplit = true // split bundle by #group
        defaultGroup = "other" // group used for standalone APKs
      }
  }

  // [AOT Only]
  // AI Packs
  assetPacks.add(":ai_packs:my_model")
  assetPacks.add(":ai_packs:my_model_mtk")

  // NPU runtime libraries
  dynamicFeatures.add(":litert_npu_runtime_libraries:mediatek_runtime")
  dynamicFeatures.add(":litert_npu_runtime_libraries:qualcomm_runtime_v69")
  dynamicFeatures.add(":litert_npu_runtime_libraries:qualcomm_runtime_v73")
  dynamicFeatures.add(":litert_npu_runtime_libraries:qualcomm_runtime_v75")
  dynamicFeatures.add(":litert_npu_runtime_libraries:qualcomm_runtime_v79")
  dynamicFeatures.add(":litert_npu_runtime_libraries:qualcomm_runtime_v81")
}

dependencies {
  // Dependencies for strings used in the runtime library modules.
  implementation(project(":litert_npu_runtime_libraries:runtime_strings"))
  ...
}

[AOT only] Use on-demand deployment

With the Android AI Pack feature configured in the build.gradle.kts file, check the device capabilities and use NPU on capable devices, using GPU and CPU as a fallback:

val env = Environment.create(BuiltinNpuAcceleratorProvider(context))

val modelProvider = AiPackModelProvider(
    context, "my_model", "model/my_model.tflite") {
    if (NpuCompatibilityChecker.Qualcomm.isDeviceSupported())
      setOf(Accelerator.NPU) else setOf(Accelerator.CPU, Accelerator.GPU)
}
val mtkModelProvider = AiPackModelProvider(
    context, "my_model_mtk", "model/my_model_mtk.tflite") {
    if (NpuCompatibilityChecker.Mediatek.isDeviceSupported())
      setOf(Accelerator.NPU) else setOf()
}
val modelSelector = ModelSelector(modelProvider, mtkModelProvider)
val model = modelSelector.selectModel(env)

val compiledModel = CompiledModel.create(
    model.getPath(),
    CompiledModel.Options(model.getCompatibleAccelerators()),
    env,
)

Create CompiledModel for JIT mode

val env = Environment.create(BuiltinNpuAcceleratorProvider(context))

val compiledModel = CompiledModel.create(
    "model/my_model.tflite",
    CompiledModel.Options(Accelerator.NPU),
    env,
)

Inference on NPU using LiteRT in Kotlin

To get started using the NPU accelerator, pass the NPU parameter when creating the Compiled Model (CompiledModel).

The following code snippet shows a basic implementation of the entire process in Kotlin:

val inputBuffers = model.createInputBuffers()
val outputBuffers = model.createOutputBuffers()

inputBuffers[0].writeFloat(FloatArray(data_size) { data_value })
model.run(inputBuffers, outputBuffers)
val outputFloatArray = outputBuffers[0].readFloat()

inputBuffers.forEach { it.close() }
outputBuffers.forEach { it.close() }
model.close()

Inference on NPU using LiteRT in C++

Build dependencies

C++ users must build the dependencies of the application with LiteRT NPU acceleration. The cc_binary rule that packages the core application logic (e.g., main.cc) requires the following runtime components:

  • LiteRT C API shared library: the data attribute must include the LiteRT C API shared library (//litert/c:litert_runtime_c_api_shared_lib) and the vendor-specific dispatch shared object for the NPU (//litert/vendors/qualcomm/dispatch:dispatch_api_so).
  • NPU-specific backend libraries: For example, the Qualcomm AI RT (QAIRT) libraries for the Android host (like libQnnHtp.so, libQnnHtpPrepare.so) and the corresponding Hexagon DSP library (libQnnHtpV79Skel.so). This ensures that the LiteRT runtime can offload computations to the NPU.
  • Attribute dependencies: the deps attribute links against essential compile-time dependencies, such as LiteRT's tensor buffer (//litert/cc:litert_tensor_buffer) and the API for the NPU dispatch layer (//litert/vendors/qualcomm/dispatch:dispatch_api). This enables your application code to interact with the NPU through LiteRT.
  • Model files and other assets: Included through the data attribute.

This setup allows your compiled binary to dynamically load and use the NPU for accelerated machine learning inference.

Set up an NPU environment

Some NPU backends require runtime dependencies or libraries. When using compiled model API, LiteRT organizes these requirements through an Environment object. Use the following code to find the appropriate NPU libraries or drivers:

// Provide a dispatch library directory (following is a hypothetical path) for the NPU
std::vector<Environment::Option> environment_options = {
    {
      Environment::OptionTag::DispatchLibraryDir,
      "/usr/lib64/npu_dispatch/"
    }
};

LITERT_ASSIGN_OR_RETURN(auto env, Environment::Create(absl::MakeConstSpan(environment_options)));

Runtime integration

The following code snippet shows a basic implementation of the entire process in C++:

// 1. Load the model that has NPU-compatible ops
LITERT_ASSIGN_OR_RETURN(auto model, Model::Load("mymodel_npu.tflite"));

// 2. Create a compiled model with NPU acceleration
//    See following section on how to set up NPU environment
LITERT_ASSIGN_OR_RETURN(auto compiled_model,
  CompiledModel::Create(env, model, kLiteRtHwAcceleratorNpu));

// 3. Allocate I/O buffers
LITERT_ASSIGN_OR_RETURN(auto input_buffers, compiled_model.CreateInputBuffers());
LITERT_ASSIGN_OR_RETURN(auto output_buffers, compiled_model.CreateOutputBuffers());

// 4. Fill model inputs (CPU array -> NPU buffers)
float input_data[] = { /* your input data */ };
input_buffers[0].Write<float>(absl::MakeConstSpan(input_data, /*size*/));

// 5. Run inference
compiled_model.Run(input_buffers, output_buffers);

// 6. Access model output
std::vector<float> data(output_data_size);
output_buffers[0].Read<float>(absl::MakeSpan(data));

Zero-copy with NPU acceleration

Using zero-copy enables an NPU to access data directly in its own memory without the need for the CPU to explicitly copy that data. By not copying data to and from CPU memory, zero-copy can significantly reduce end-to-end latency.

The following code is an example implementation of Zero-Copy NPU with AHardwareBuffer, passing data directly to the NPU. This implementation avoids expensive round-trips to CPU memory, significantly reducing inference overhead.

// Suppose you have AHardwareBuffer* ahw_buffer

LITERT_ASSIGN_OR_RETURN(auto tensor_type, model.GetInputTensorType("input_tensor"));

LITERT_ASSIGN_OR_RETURN(auto npu_input_buffer, TensorBuffer::CreateFromAhwb(
    env,
    tensor_type,
    ahw_buffer,
    /* offset = */ 0
));

std::vector<TensorBuffer> input_buffers{npu_input_buffer};

LITERT_ASSIGN_OR_RETURN(auto output_buffers, compiled_model.CreateOutputBuffers());

// Execute the model
compiled_model.Run(input_buffers, output_buffers);

// Retrieve the output (possibly also an AHWB or other specialized buffer)
auto ahwb_output = output_buffers[0].GetAhwb();

Chain multiple NPU inferences

For complex pipelines, you can chain multiple NPU inferences. Since each step uses an accelerator-friendly buffer, your pipeline stays mostly in NPU-managed memory:

// compiled_model1 outputs into an AHWB
compiled_model1.Run(input_buffers, intermediate_buffers);

// compiled_model2 consumes that same AHWB
compiled_model2.Run(intermediate_buffers, final_outputs);

NPU just-in-time compilation caching

LiteRT supports NPU just-in-time (JIT) compilation of .tflite models. JIT compilation can be especially useful in situations where compiling the model ahead of time is not feasible.

JIT compilation however can come with some latency and memory overhead to translate the user-provided model into NPU bytecode instructions on-demand. To minimize the performance impact NPU compilation artifacts can be cached.

When caching is enabled LiteRT will only trigger the re-compilation of the model when required, e.g.:

  • The vendor's NPU compiler plugin version changed;
  • The Android build fingerprint changed;
  • The user-provided model changed;
  • The compilation options changed.

In order to enable NPU compilation caching, specify the CompilerCacheDir environment tag in the environment options. The value must be set to an existing writable path of the application.

   const std::array environment_options = {
        litert::Environment::Option{
            /*.tag=*/litert::Environment::OptionTag::CompilerPluginLibraryDir,
            /*.value=*/kCompilerPluginLibSearchPath,
        },
        litert::Environment::Option{
            litert::Environment::OptionTag::DispatchLibraryDir,
            kDispatchLibraryDir,
        },
        // 'kCompilerCacheDir' will be used to store NPU-compiled model
        // artifacts.
        litert::Environment::Option{
            litert::Environment::OptionTag::CompilerCacheDir,
            kCompilerCacheDir,
        },
    };

    // Create an environment.
    LITERT_ASSERT_OK_AND_ASSIGN(
        auto environment, litert::Environment::Create(environment_options));

    // Load a model.
    auto model_path = litert::testing::GetTestFilePath(kModelFileName);
    LITERT_ASSERT_OK_AND_ASSIGN(auto model,
                                litert::Model::CreateFromFile(model_path));

    // Create a compiled model, which only triggers NPU compilation if
    // required.
    LITERT_ASSERT_OK_AND_ASSIGN(
        auto compiled_model, litert::CompiledModel::Create(
                                 environment, model, kLiteRtHwAcceleratorNpu));

Example latency and memory savings:

The time and memory required for NPU compilation can vary based on several factors, like the underlying NPU chip, the complexity of the input model etc.

The following table compares the runtime initialization time and memory consumption when NPU compilation is required versus when compilation can be skipped due to caching. On one sample device we obtain the following:

TFLite model model init with NPU compilation model init with cached compilation init memory footprint with NPU compilation init memory with cached compilation
torchvision_resnet152.tflite 7465.22 ms 198.34 ms 1525.24 MB 355.07 MB
torchvision_lraspp_mobilenet_v3_large.tflite 1592.54 ms 166.47 ms 254.90 MB 33.78 MB

On another device we obtain the following:

TFLite model model init with NPU compilation model init with cached compilation init memory footprint with NPU compilation init memory with cached compilation
torchvision_resnet152.tflite 2766.44 ms 379.86 ms 653.54 MB 501.21 MB
torchvision_lraspp_mobilenet_v3_large.tflite 784.14 ms 231.76 ms 113.14 MB 67.49 MB