The LLM Inference API lets you run large language models (LLMs) completely on-device, which you can use to perform a wide range of tasks, such as generating text, retrieving information in natural language form, and summarizing documents. The task provides built-in support for multiple text-to-text large language models, so you can apply the latest on-device generative AI models to your apps and products.
The task supports the following variants of Gemma: Gemma-2 2B, Gemma 2B, and Gemma 7B. Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. It also supports the following external models: Phi-2, Falcon-RW-1B and StableLM-3B.
In addition to the supported models, you can use Google's AI Edge
Torch to export PyTorch
models into multi-signature LiteRT (tflite
) models, which are bundled with
tokenizer parameters to create Task Bundles that are compatible with the LLM
Inference API. Models converted with AI Edge Torch can only run on the CPU
backend and are therefore limited to Android and iOS.
Get Started
Start using this task by following one of these implementation guides for your target platform. These platform-specific guides walk you through a basic implementation of this task, with code examples that use an available model and the recommended configuration options:
Web:
Android:
iOS
Task details
This section describes the capabilities, inputs, outputs, and configuration options of this task.
Features
The LLM Inference API contains the following key features:
- Text-to-text generation - Generate text based on an input text prompt.
- LLM selection - Apply multiple models to tailor the app for your specific use cases. You can also retrain and apply customized weights to the model.
- LoRA support - Extend and customize the LLM capability with LoRA model either by training on your all dataset, or taking prepared prebuilt LoRA models from the open-source community (not compatible with models converted with the AI Edge Torch Generative API).
Task inputs | Task outputs |
---|---|
The LLM Inference API accepts the following inputs:
|
The LLM Inference API outputs the following results:
|
Configurations options
This task has the following configuration options:
Option Name | Description | Value Range | Default Value |
---|---|---|---|
modelPath |
The path to where the model is stored within the project directory. | PATH | N/A |
maxTokens |
The maximum number of tokens (input tokens + output tokens) the model handles. | Integer | 512 |
topK |
The number of tokens the model considers at each step of generation. Limits predictions to the top k most-probable tokens. | Integer | 40 |
temperature |
The amount of randomness introduced during generation. A higher temperature results in more creativity in the generated text, while a lower temperature produces more predictable generation. | Float | 0.8 |
randomSeed |
The random seed used during text generation. | Integer | 0 |
loraPath |
The absolute path to the LoRA model locally on the device. Note: this is only compatible with GPU models. | PATH | N/A |
resultListener |
Sets the result listener to receive the results asynchronously. Only applicable when using the async generation method. | N/A | N/A |
errorListener |
Sets an optional error listener. | N/A | N/A |
Models
The LLM Inference API supports many text-to-text large language models, including built-in support for several models that are optimized to run on browsers and mobile devices. These lightweight models can be used to run inferences completely on-device.
Before initializing the LLM Inference API, download a model and store the file within your project directory. You can use a pre-converted model or convert a model to a MediaPipe-compatible format.
The LLM Inference API is compatible with two categories types of models, some of which require model conversion. Use the table to identify the required steps method for your model.
Models | Conversion method | Compatible platforms | File type | |
---|---|---|---|---|
Supported models | Gemma 2B, Gemma 7B, Gemma-2 2B, Phi-2, StableLM, Falcon | MediaPipe | Android, iOS, web | .bin |
Other PyTorch models | All PyTorch LLM models | AI Edge Torch Generative library | Android, iOS | .task |
We are hosting the converted .bin
files for Gemma 2B, Gemma 7B, and Gemma-2 2B
on Kaggle. These models can be directly deployed using our LLM Inference API. To
learn how you can convert other models, see the Model
Conversion section.
Gemma-2 2B
Gemma-2 2B is the latest model in the Gemma family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. The model contains 2B parameters and open weights. Gemma-2 2B is known for state-of-the art reasoning skills for models in its class.
The Gemma-2 2B models are available in the following variants:
- gemma2-2b-it-cpu-int8: Gemma-2 2B 8-bit model with CPU compatibility.
- gemma2-2b-it-gpu-int8: Gemma-2 2B 8-bit model with GPU compatibility.
You can also tune the model and add new weights before adding it to the app. For more information on tuning and customizing Gemma, see Tuning Gemma. After downloading Gemma-2 2B from Kaggle Models, the model is already in the appropriate format to use with MediaPipe Tasks.
Gemma 2B
Gemma 2B is a part of a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. The model contains 2B parameters and open weights. This model is well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning.
The Gemma 2B models are available in the following variants:
- gemma-2b-it-cpu-int4: Gemma 2B 4-bit model with CPU compatibility.
- gemma-2b-it-cpu-int8: Gemma 2B 8-bit model with CPU compatibility.
- gemma-2b-it-gpu-int4: Gemma 2B 4-bit model with GPU compatibility.
- gemma-2b-it-gpu-int8: Gemma 2B 8-bit model with GPU compatibility.
You can also tune the model and add new weights before adding it to the app. For more information on tuning and customizing Gemma, see Tuning Gemma. After downloading Gemma 2B from Kaggle Models, the model is already in the appropriate format to use with MediaPipe Tasks.
Gemma 7B
Gemma 7B is a larger Gemma model with 7B parameters and open weights. The model is more powerful for a variety of text generation tasks, including question answering, summarization, and reasoning. Gemma 7B is only supported on Web.
The Gemma 7B model comes in one variant:
- gemma-1.1-7b-it-gpu-int8: Gemma 7B 8-bit model with GPU compatibility.
After downloading Gemma 7B from Kaggle Models, the model is already in the appropriate format to use with MediaPipe.
Falcon 1B
Falcon-1B is a 1 billion parameter causal decoder-only model trained on 350B tokens of RefinedWeb.
The LLM Inference API requires the following files to be downloaded and stored locally:
tokenizer.json
tokenizer_config.json
pytorch_model.bin
After downloading the Falcon model files, the model is ready to be converted to the MediaPipe format with a conversion script. Follow the steps in the Conversion script for supported models section.
StableLM 3B
StableLM-3B is a 3 billion parameter decoder-only language model pre-trained on 1 trillion tokens of diverse English and code datasets for 4 epochs.
The LLM Inference API requires the following files to be downloaded and stored locally:
tokenizer.json
tokenizer_config.json
model.safetensors
After downloading the StableLM model files, the model is ready to be converted to the MediaPipe format with a conversion script. Follow the steps in the Conversion script for supported models section.
Phi-2
Phi-2 is a 2.7 billion parameter Transformer model. It was trained using various NLP synthetic texts and filtered websites. The model is best suited for prompts using the Question-Answer, chat, and code format.
The LLM Inference API requires the following files to be downloaded and stored locally:
tokenizer.json
tokenizer_config.json
model-00001-of-00002.safetensors
model-00002-of-00002.safetensors
After downloading the Phi-2 model files, the model is ready to be converted to the MediaPipe format with a conversion script. Follow the steps in the Conversion script for supported models section.
Generative PyTorch Models
PyTorch generative models can be converted to a MediaPipe-compatible format with the AI Edge Torch Generative API. You can use the API to convert PyTorch models into multi-signature LiteRT (TensorFlow Lite) models. For more details on mapping and exporting models, visit the AI Edge Torch GitHub page.
If you intend to use the AI Edge Torch Generative API to convert a PyTorch model, follow the steps in the Torch Generative converter for PyTorch models section.
Model conversion
The MediaPipe LLM Inference API lets you run a wide variety of large language models on-device. This includes models that have been pre-converted to a MediaPipe-compatible format, as well as other models that can be converted with a conversion script or the AI Edge Torch library.
The LLM Inference API accepts models in the .bin
and .task
file formats.
Pre-converted models and models converted with the conversion script will be
.bin
files, while models converted with the AI Edge Torch library will be
.task
files. Do not manually alter the file formats of your converted models.
The LLM Inference API contains three model conversion paths:
- Pre-converted models (Gemma 2B, Gemma 7B, Gemma-2 2B): No conversion required.
- Supported models (Phi-2, StableLM, Falcon): MediaPipe conversion script.
- Other PyTorch models (All PyTorch LLM models): AI Edge Torch Generative API.
Pre-converted models
The Gemma-2 2B, Gemma 2B and Gemma 7B models are available as pre-converted models in the MediaPipe format. These models do not require any additional conversion steps from the user and can be run as-is with the LLM Inference API.
You can download the Gemma-2 2B from Kaggle Models:
- gemma2-2b-it-cpu-int8: Gemma-2 2B 8-bit model with CPU compatibility.
- gemma2-2b-it-gpu-int8: Gemma-2 2B 8-bit model with GPU compatibility.
You can download variants of Gemma 2B from Kaggle Models:
- gemma-2b-it-cpu-int4: Gemma 2B 4-bit model with CPU compatibility.
- gemma-2b-it-cpu-int8: Gemma 2B 8-bit model with CPU compatibility.
- gemma-2b-it-gpu-int4: Gemma 2B 4-bit model with GPU compatibility.
- gemma-2b-it-gpu-int8: Gemma 2B 8-bit model with GPU compatibility.
You can download the Gemma 7B from Kaggle Models:
- gemma-1.1-7b-it-gpu-int8: Gemma 7B 8-bit model with GPU compatibility.
For more information on the Gemma models, see the documentation on Gemma-2 2B, Gemma 2B and Gemma 7B.
Conversion script for supported models
The MediaPipe package offers a conversion script to convert the following external models into a MediaPipe-compatible format:
For more information on the supported external models, see the documentation on Falcon 1B, StableLM 3B, and Phi-2.
The model conversion process requires the MediaPipe PyPI package. The conversion
script is available in all MediaPipe packages after 0.10.11
.
Install and import the dependencies with the following:
$ python3 -m pip install mediapipe
Use the genai.converter
library to convert the model:
import mediapipe as mp
from mediapipe.tasks.python.genai import converter
config = converter.ConversionConfig(
input_ckpt=INPUT_CKPT,
ckpt_format=CKPT_FORMAT,
model_type=MODEL_TYPE,
backend=BACKEND,
output_dir=OUTPUT_DIR,
combine_file_only=False,
vocab_model_file=VOCAB_MODEL_FILE,
output_tflite_file=OUTPUT_TFLITE_FILE,
)
converter.convert_checkpoint(config)
To convert the LoRA model, the ConversionConfig
should specify the base model
options as well as additional LoRA options. Notice that since the API only
supports LoRA inference with GPU, the backend must be set to 'gpu'
.
import mediapipe as mp
from mediapipe.tasks.python.genai import converter
config = converter.ConversionConfig(
# Other params related to base model
...
# Must use gpu backend for LoRA conversion
backend='gpu',
# LoRA related params
lora_ckpt=LORA_CKPT,
lora_rank=LORA_RANK,
lora_output_tflite_file=LORA_OUTPUT_TFLITE_FILE,
)
converter.convert_checkpoint(config)
The converter will output two TFLite flatbuffer files, one for the base model and the other for the LoRA model.
Parameter | Description | Accepted Values |
---|---|---|
input_ckpt |
The path to the model.safetensors or pytorch.bin file. Note that sometimes the model safetensors format are sharded into multiple files, e.g. model-00001-of-00003.safetensors , model-00001-of-00003.safetensors . You can specify a file pattern, like model*.safetensors . |
PATH |
ckpt_format |
The model file format. | {"safetensors", "pytorch"} |
model_type |
The LLM being converted. | {"PHI_2", "FALCON_RW_1B", "STABLELM_4E1T_3B", "GEMMA_2B"} |
backend |
The processor (delegate) used to run the model. | {"cpu", "gpu"} |
output_dir |
The path to the output directory that hosts the per-layer weight files. | PATH |
output_tflite_file |
The path to the output file. For example, "model_cpu.bin" or "model_gpu.bin". This file is only compatible with the LLM Inference API, and cannot be used as a general `tflite` file. | PATH |
vocab_model_file |
The path to the directory that stores the tokenizer.json and
tokenizer_config.json files. For Gemma, point to the single tokenizer.model file. |
PATH |
lora_ckpt |
The path to the LoRA ckpt of safetensors file that stores the LoRA adapter weight. | PATH |
lora_rank |
An integer representing the rank of LoRA ckpt. Required in order to convert the lora weights. If not provided, then the converter assumes there are no LoRA weights. Note: Only the GPU backend supports LoRA. | Integer |
lora_output_tflite_file |
Output tflite filename for the LoRA weights. | PATH |
Torch Generative converter for PyTorch models
PyTorch generative models can be converted to a MediaPipe-compatible format with the AI Edge Torch Generative API. You can use the API to author, convert, and quantize PyTorch LLMs to use with the LLM Inference API. The Torch Generative converter only converts for CPU and requires a Linux machine with at least 64 GBs of RAM.
Converting a PyTorch model with the AI Edge Torch Generative API involves the following:
- Download the PyTorch model checkpoints
- Use the AI Edge Torch Generative API to author, convert, and quantize the
model to a MediaPipe-compatible file format (
.tflite
). - Create a Task Bundle (
.task
) from the tflite file and the model tokenizer.
To create a Task Bundle, use the bundling script to create a Task Bundle. The bundling process packs the mapped model with additional metadata (e.g., Tokenizer Parameters) needed to run end-to-end inference.
The model bundling process requires the MediaPipe PyPI package. The conversion
script is available in all MediaPipe packages after 0.10.14
.
Install and import the dependencies with the following:
$ python3 -m pip install mediapipe
Use the genai.bundler
library to bundle the model:
import mediapipe as mp
from mediapipe.tasks.python.genai import bundler
config = bundler.BundleConfig(
tflite_model=TFLITE_MODEL,
tokenizer_model=TOKENIZER_MODEL,
start_token=START_TOKEN,
stop_tokens=STOP_TOKENS,
output_filename=OUTPUT_FILENAME,
enable_bytes_to_unicode_mapping=ENABLE_BYTES_TO_UNICODE_MAPPING,
)
bundler.create_bundle(config)
Parameter | Description | Accepted Values |
---|---|---|
tflite_model |
The path to the AI Edge exported TFLite model. | PATH |
tokenizer_model |
The path to the SentencePiece tokenizer model. | PATH |
start_token |
Model specific start token. The start token must be present in the provided tokenizer model. | STRING |
stop_tokens |
Model specific stop tokens. The stop tokens must be present in the provided tokenizer model. | LIST[STRING] |
output_filename |
The name of the output task bundle file. | PATH |
LoRA customization
Mediapipe LLM inference API can be configured to support Low-Rank Adaptation (LoRA) for large language models. Utilizing fine-tuned LoRA models, developers can customize the behavior of LLMs through a cost-effective training process.LoRA support of the LLM Inference API works for all Gemma variants and Phi-2 models for the GPU backend, with LoRA weights applicable to attention layers only. This initial implementation serves as an experimental API for future developments with plans to support more models and various types of layers in the coming updates.
Prepare LoRA models
Follow the instructions on
HuggingFace
to train a fine tuned LoRA model on your own dataset with supported model types,
Gemma or Phi-2. Gemma-2 2B, Gemma
2B and
Phi-2 models are both available on
HuggingFace in the safetensors format. Since LLM Inference API only supports
LoRA on attention layers, only specify attention layers while creating the
LoraConfig
as following:
# For Gemma
from peft import LoraConfig
config = LoraConfig(
r=LORA_RANK,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
)
# For Phi-2
config = LoraConfig(
r=LORA_RANK,
target_modules=["q_proj", "v_proj", "k_proj", "dense"],
)
For testing, there are publicly accessible fine-tuned LoRA models which fit LLM Inference API available on HuggingFace. For example, monsterapi/gemma-2b-lora-maths-orca-200k for Gemma-2B and lole25/phi-2-sft-ultrachat-lora for Phi-2.
After training on the prepared dataset and saving the model, you obtain an
adapter_model.safetensors
file containing the fine-tuned LoRA model weights.
The safetensors file is the LoRA checkpoint used in the model conversion.
As the next step, you need convert the model weights into a TensorFlow Lite
Flatbuffer using the MediaPipe Python Package. The ConversionConfig
should
specify the base model options as well as additional LoRA options. Notice that
since the API only supports LoRA inference with GPU, the backend must be set to
'gpu'
.
import mediapipe as mp
from mediapipe.tasks.python.genai import converter
config = converter.ConversionConfig(
# Other params related to base model
...
# Must use gpu backend for LoRA conversion
backend='gpu',
# LoRA related params
lora_ckpt=LORA_CKPT,
lora_rank=LORA_RANK,
lora_output_tflite_file=LORA_OUTPUT_TFLITE_FILE,
)
converter.convert_checkpoint(config)
The converter will output two TFLite flatbuffer files, one for the base model and the other for the LoRA model.
LoRA model inference
The Web, Android and iOS LLM Inference API are updated to support LoRA model inference.
Android supports static LoRA during initialization. To load a LoRA model, users specify the LoRA model path as well as the base LLM.// Set the configuration options for the LLM Inference task
val options = LlmInferenceOptions.builder()
.setModelPath('<path to base model>')
.setMaxTokens(1000)
.setTopK(40)
.setTemperature(0.8)
.setRandomSeed(101)
.setLoraPath('<path to LoRA model>')
.build()
// Create an instance of the LLM Inference task
llmInference = LlmInference.createFromOptions(context, options)
To run LLM inference with LoRA, use the same generateResponse()
or
generateResponseAsync()
methods as the base model.