The LiteRT Torch Generative API is a high-performance library designed for authoring and converting transformer-based PyTorch models into the LiteRT/LiteRT-LM format. This enables developers to seamlessly deploy generative AI models, specifically Large Language Models (LLMs), for on-device text and image generation with ease.
The Torch Generative API supports model conversion for CPU, GPU and NPU. By pairing Torch Generative API with LiteRT-LM, you can build responsive, privacy-focused applications that run generative models entirely on-device.
Convert from Hugging Face Transformer Library
The LiteRT Torch Hugging Face Export extension provides a streamlined pathway to convert generative AI models directly from the Hugging Face Transformers Library into the LiteRT-LM format. Compared to LiteRT Torch Generative APIs that provide you pytorch building blocks to build and optimize custom models, this tool handles the complexities of downloading weights, translating PyTorch model architectures, and applying optimization techniques like graph optimizations and quantization in a single workflow. It outputs a .litertlm file, which is optimized for on-device inference on CPU, GPU and NPU using the LiteRT-LM runtime.
Prerequisites
Before using the export extension, ensure you have the following setup:
- Install LiteRT Torch Python package. The Hugging Face Export
extension is built directly into the
litert-torchpackage. - (Optional) For NPU compilation, install LiteRT NPU SDK extensions using
pip install ai-edge-litert[npu-sdk]. Fore more details, you can follow LiteRT NPU AOT Compilation Colab. - Hugging Face environment is set up if you intend to load from Hugging Face
hub directly. The export_hf tool uses the standard transformers authentication
mechanisms like
HF_TOKENor CLI. See example:
To download gated models (such as Gemma or Llama), you must authenticate with Hugging Face using either the CLI or an environment variable:
# Set your Hugging Face token as an environment variable
export HF_TOKEN="your_hugging_face_token"
# Or use the Hugging Face CLI login
hf auth login
Basic Usage
You can use export_hf using the command line or the Python API. The tool will
automatically download the model from Hugging Face or load the model from local
path provided, trace it, apply default optimizations, and convert it to
a .litertlm file compatible for CPU and GPU inference.
Command Line Interface (CLI)
Use the litert-torch export_hf command. You need to provide the Hugging Face
model ID and the chosen output directory.
litert-torch export_hf \
--model=google/gemma-3-270m-it \
--output_dir=/tmp/gemma3-270m-it-litertlm
For exporting a local or custom model, you can also pass the path to the safetensor checkpoint:
litert-torch export_hf \
--model=/path/to/safetensor/dir \
--output_dir=/my_custom_litertlm
Python API
For integration into Python scripts or notebooks, import the export module
from litert_torch.generative.export_hf.
from litert_torch.generative.export_hf import export
export.export(
model='google/gemma-3-270m-it',
output_dir='/tmp/gemma3-270m-it-litertlm',
)
On-device deployment with LiteRT-LM
Once you have successfully exported your model to a .litertlm file, you can
deploy it directly on-device using LiteRT-LM for high-performance execution on
both CPU and GPU. See details on how to use the LiteRT-LM API. For
NPU acceleration, refer to NPU AOT compilation guide.
Supported Architectures
The export_hf tool verifies the following Transformers model architectures.
This can be verified by checking the model_type field in config.json.
- Gemma 3 (
Gemma3ForCausalLM) - Gemma 3n (
Gemma3nForCausalLM) - Gemma 4 (
Gemma4ForCausalLM) - Llama (
LlamaForCausalLM) - Mistral (
MistralForCausalLM) - Qwen 2/2.5 (
Qwen2ForCausalLM) - Qwen 3 (
Qwen3ForCausalLM) - SmolLM 3 (
SmolLM3ForCausalLM)
Advanced settings
While you can explore the advanced options available in the extension flags, the follows are some common knobs you can try.
Vision Language Models
For supported models, you can set --task=image_text_to_text and
--export_vision_encoder to load and export the vision encoder model.
Supported architectures:
- Gemma 3 (
Gemma3ForConditionalGeneration) - Gemma 4 (
Gemma4ForConditionalGeneration)
Quantization Configuration
Generative AI models are often too large to run efficiently on edge devices
without optimization. By default, export_hf applies the dynamic_wi8_afp32
quantization recipe using AI Edge Quantizer,
which quantizes weights to per-channel INT8 while keeping activations in FP32.
You can override this default behavior using the --quantization_recipe flag
(or the quantization_recipe parameter in Python).
You can provide the name of a built-in recipe from
AI Edge Quantizer or specify the path to a custom JSON
recipe.
Example:
litert-torch export_hf \
--model=google/gemma-3-270m-it \
--output_dir=/tmp/gemma3-270m-it-litertlm \
--quantization_recipe=/path/to/my/quantization_recipe.json
Jinja Template Override
The jinja template coming with the transformers model might not be compatible
with LiteRT-LM (e.g. Gemma4 models), you can set use_jinja_template flag to
False or use jinja_chat_template_override option to override the template.
Example:
litert-torch export_hf \
--model=google/gemma-4-E2B-it \
--output_dir=/tmp/gemma4_2b_litertlm \
--externalize_embedder \
--jinja_chat_template_override=litert-community/gemma-4-E2B-it-litert-lm
NPU AOT Compilation
In addition to CPU and GPU, you can also target supported NPU accelerators when exporting your models by providing the NPU specific options.
Google Tensor
Prerequisites: Follow Google Tensor SDK page for the development environment setup.
To export LLMs targeting Google Tensor TPUs, follow the example for the additional flags required for TPU compilation.
Example:
litert-torch export-hf \
--model=google/gemma-3-270m-it \
--output_dir=/tmp/gemma3-270m-google-tensor-g5 \
--split_cache \
--externalize_embedder \
--prefill_lengths=128, \
--cache_length=1280 \
--quantization_recipe="weight_only_wi8_afp32"
--aot_backend=GOOGLE \
--aot_soc_model=Tensor_G5 \
--aot_compilation_config_dict='{"google_tensor_enable_large_model_support": True}'
For more information, see Compile models with Google Tensor SDK.
Qualcomm AI Runtime:
Prerequisites: Follow LiteRT Qualcomm Integration for SDK setup instructions and supported devices.
Example:
litert-torch export-hf \
--model=google/gemma-3-270m-it \
--output_dir=/tmp/gemma3-270m-google-tensor-g5 \
--split_cache \
--externalize_embedder \
--quantization_recipe='' \
--aot_backend=qualcomm \
--aot_soc_model=SM8750
MediaTek NeuroPilot:
Prerequisites: Follow LiteRT MediaTek Integration for SDK setup instructions and supported devices.
Example:
litert-torch export-hf \
--model=google/gemma-3-270m-it \
--output_dir=/tmp/gemma3-270m-google-tensor-g5 \
--split_cache \
--externalize_embedder \
--aot_backend=mediatek \
--aot_soc_model=MT8189
Intel OpenVINO
Prerequisites: Follow LiteRT Intel OpenVINO Integration for SDK setup instructions and supported devices.
Example:
litert-torch export-hf \
--model=google/gemma-3-270m-it \
--output_dir=/tmp/gemma3-270m-google-tensor-g5 \
--split_cache \
--externalize_embedder \
--aot_backend=intel_openvino \
--aot_soc_model=PTL
Re-author and Convert using LiteRT Torch Generative API
LiteRT Torch Generative API also provides building blocks to build and optimize custom PyTorch models, including but not limited to normalizer layers, attentions and other basic modules. If your model is not covered by the LiteRT Torch Hugging Face Export extension, you can build your own models to be compatible with LiteRT and LiteRT-LM.
There are model examples including LLMs, diffusion models and ASR models. Feel free to check those out and deploy your own model.
For more information, see the Generative Torch API GitHub repo.