The LLM Inference API lets you run large language models (LLMs) completely on-device for Web applications, which you can use to perform a wide range of tasks, such as generating text, retrieving information in natural language form, and summarizing documents. The task provides built-in support for multiple text-to-text large language models, so you can apply the latest on-device generative AI models to your Web apps.
To quickly add the LLM Inference API to your Web application, follow the Quickstart. For a basic example of a Web application running the LLM Inference API, see the sample application. For a more in-depth understanding of how the LLM Inference API works, refer to the configuration options, model conversion, and LoRA tuning sections.
You can see this task in action with the MediaPipe Studio demo. For more information about the capabilities, models, and configuration options of this task, see the Overview.
Quickstart
Use the following steps to add the LLM Inference API to your Web application. The LLM Inference API requires a web browser with WebGPU compatibility. For a full list of compatible browsers, see GPU browser compatibility.
Add dependencies
The LLM Inference API uses the
@mediapipe/tasks-genai
package.
Install the required packages for local staging:
npm install @mediapipe/tasks-genai
To deploy to a server, use a content delivery network (CDN) service like jsDelivr to add code directly to your HTML page:
<head>
<script src="https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai/genai_bundle.cjs"
crossorigin="anonymous"></script>
</head>
Download a model
Download Gemma-2 2B in an 8-bit quantized format from Kaggle Models. For more information on the available models, see the Models documentation.
Store the model within your project directory:
<dev-project-root>/assets/gemma-2b-it-gpu-int8.bin
Specify the path of the model with the baseOptions
object modelAssetPath
parameter:
baseOptions: { modelAssetPath: `/assets/gemma-2b-it-gpu-int8.bin`}
Initialize the Task
Initialize the task with basic configuration options:
const genai = await FilesetResolver.forGenAiTasks(
// path/to/wasm/root
"https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@latest/wasm"
);
llmInference = await LlmInference.createFromOptions(genai, {
baseOptions: {
modelAssetPath: '/assets/gemma-2b-it-gpu-int8.bin'
},
maxTokens: 1000,
topK: 40,
temperature: 0.8,
randomSeed: 101
});
Run the Task
Use the generateResponse()
function to trigger inferences.
const response = await llmInference.generateResponse(inputPrompt);
document.getElementById('output').textContent = response;
To stream the response, use the following:
llmInference.generateResponse(
inputPrompt,
(partialResult, done) => {
document.getElementById('output').textContent += partialResult;
});
Sample application
The sample application is an example of a basic text generation app for Web, using the LLM Inference API. You can use the app as a starting point for your own Web app, or refer to it when modifying an existing app. The example code is hosted on GitHub.
Clone the git repository using the following command:
git clone https://github.com/google-ai-edge/mediapipe-samples
For more information, see the Setup Guide for Web.
Configuration options
Use the following configuration options to set up a Web app:
Option Name | Description | Value Range | Default Value |
---|---|---|---|
modelPath |
The path to where the model is stored within the project directory. | PATH | N/A |
maxTokens |
The maximum number of tokens (input tokens + output tokens) the model handles. | Integer | 512 |
topK |
The number of tokens the model considers at each step of generation. Limits predictions to the top k most-probable tokens. | Integer | 40 |
temperature |
The amount of randomness introduced during generation. A higher temperature results in more creativity in the generated text, while a lower temperature produces more predictable generation. | Float | 0.8 |
randomSeed |
The random seed used during text generation. | Integer | 0 |
loraRanks |
LoRA ranks to be used by the LoRA models during runtime. Note: this is only compatible with GPU models. | Integer array | N/A |
Model conversion
The LLM Inference API is compatible with the following types of models, some of which require model conversion. Use the table to identify the required steps method for your model.
Models | Conversion method | Compatible platforms | File type |
---|---|---|---|
Gemma-3 1B | No conversion required | Android, web | .task |
Gemma 2B, Gemma 7B, Gemma-2 2B | No conversion required | Android, iOS, web | .bin |
Phi-2, StableLM, Falcon | MediaPipe conversion script | Android, iOS, web | .bin |
All PyTorch LLM models | AI Edge Torch Generative library | Android, iOS | .task |
To learn how you can convert other models, see the Model Conversion section.
LoRA customization
The LLM Inference API supports LoRA (Low-Rank Adaptation) tuning using the PEFT (Parameter-Efficient Fine-Tuning) library. LoRA tuning customizes the behavior of LLMs through a cost-effective training process, creating a small set of trainable weights based on new training data rather than retraining the entire model.
The LLM Inference API supports adding LoRA weights to attention layers of the
Gemma-2 2B, Gemma
2B and
Phi-2 models. Download the model in
the safetensors
format.
The base model must be in the safetensors
format in order to create LoRA
weights. After LoRA training, you can convert the models into the FlatBuffers
format to run on MediaPipe.
Prepare LoRA weights
Use the LoRA Methods guide from PEFT to train a fine-tuned LoRA model on your own dataset.
The LLM Inference API only supports LoRA on attention layers, so only specify the
attention layers in LoraConfig
:
# For Gemma
from peft import LoraConfig
config = LoraConfig(
r=LORA_RANK,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
)
# For Phi-2
config = LoraConfig(
r=LORA_RANK,
target_modules=["q_proj", "v_proj", "k_proj", "dense"],
)
After training on the prepared dataset and saving the model, the fine-tuned LoRA
model weights are available in adapter_model.safetensors
. The safetensors
file is the LoRA checkpoint used during model conversion.
Model conversion
Use the MediaPipe Python Package to convert the model weights into the
Flatbuffer format. The ConversionConfig
specifies the base model options along
with the additional LoRA options.
import mediapipe as mp
from mediapipe.tasks.python.genai import converter
config = converter.ConversionConfig(
# Other params related to base model
...
# Must use gpu backend for LoRA conversion
backend='gpu',
# LoRA related params
lora_ckpt=LORA_CKPT ,
lora_rank=LORA_RANK ,
lora_output_tflite_file=LORA_OUTPUT_FILE ,
)
converter.convert_checkpoint(config)
The converter will produce two MediaPipe-compatible files, one for the base model and another for the LoRA model.
LoRA model inference
Web supports dynamic LoRA during runtime, meaning users declare the LoRA ranks during initialization. This means you can swap out different LoRA models during runtime.
const genai = await FilesetResolver.forGenAiTasks(
// path/to/wasm/root
"https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@latest/wasm"
);
const llmInference = await LlmInference.createFromOptions(genai, {
// options for the base model
...
// LoRA ranks to be used by the LoRA models during runtime
loraRanks: [4, 8, 16]
});
Load the LoRA models during runtime, after initializing the base model. Trigger the LoRA model by passing the model reference when generating the LLM response.
// Load several LoRA models. The returned LoRA model reference is used to specify
// which LoRA model to be used for inference.
loraModelRank4 = await llmInference.loadLoraModel(loraModelRank4Url);
loraModelRank8 = await llmInference.loadLoraModel(loraModelRank8Url);
// Specify LoRA model to be used during inference
llmInference.generateResponse(
inputPrompt,
loraModelRank4,
(partialResult, done) => {
document.getElementById('output').textContent += partialResult;
});