The LLM Inference API lets you run large language models (LLMs) completely on the browser for web applications, which you can use to perform a wide range of tasks, such as generating text, retrieving information in natural language form, and summarizing documents. The task provides built-in support for multiple text-to-text large language models, so you can apply the latest on-device generative AI models to your web apps.
You can see this task in action with the MediaPipe Studio demo. For more information about the capabilities, models, and configuration options of this task, see the Overview.
Code example
The example application for the LLM Inference API provides a basic implementation of this task in JavaScript for your reference. You can use this sample app to get started building your own text generation app.
You can access the LLM Inference API example app on GitHub.
Setup
This section describes key steps for setting up your development environment and code projects specifically to use LLM Inference API. For general information on setting up your development environment for using MediaPipe Tasks, including platform version requirements, see the Setup guide for Web.
Browser compatibility
The LLM Inference API requires a web browser with WebGPU compatibility. For a full list of compatible browsers, see GPU browser compatibility.
JavaScript packages
LLM Inference API code is available through the
@mediapipe/tasks-genai
package. You can find and download these libraries from links provided in the
platform Setup guide.
Install the required packages for local staging:
npm install @mediapipe/tasks-genai
To deploy to a server, use a content delivery network (CDN) service like jsDelivr to add code directly to your HTML page:
<head>
<script src="https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai/genai_bundle.cjs"
crossorigin="anonymous"></script>
</head>
Model
The MediaPipe LLM Inference API requires a trained model that is compatible with this task. For web applications, the model must be GPU-compatible.
For more information on available trained models for LLM Inference API, see the task overview Models section.
Download a model
Before initializing the LLM Inference API, download one of the supported models and store the file within your project directory:
- Gemma-2 2B: The latest version of Gemma family of models. Part of a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models.
- Gemma 2B: Part of a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. Well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning.
- Phi-2: 2.7 billion parameter Transformer model, best suited for the Question-Answer, chat, and code format.
- Falcon-RW-1B: 1 billion parameter causal decoder-only model trained on 350B tokens of RefinedWeb.
- StableLM-3B: 3 billion parameter decoder-only language model pre-trained on 1 trillion tokens of diverse English and code datasets.
We recommend using Gemma-2 2B, which is available on Kaggle Models. For more information on the other available models, see the task overview Models section.
Convert model to MediaPipe format
Native model conversion
If you are using an external LLM (Phi-2, Falcon, or StableLM) or a non-Kaggle version of Gemma, use our conversion scripts to format the model to be compatible with MediaPipe.
The model conversion process requires the MediaPipe PyPI package. The conversion
script is available in all MediaPipe packages after 0.10.11
.
Install and import the dependencies with the following:
$ python3 -m pip install mediapipe
Use the genai.converter
library to convert the model:
import mediapipe as mp
from mediapipe.tasks.python.genai import converter
config = converter.ConversionConfig(
input_ckpt=INPUT_CKPT,
ckpt_format=CKPT_FORMAT,
model_type=MODEL_TYPE,
backend=BACKEND,
output_dir=OUTPUT_DIR,
combine_file_only=False,
vocab_model_file=VOCAB_MODEL_FILE,
output_tflite_file=OUTPUT_TFLITE_FILE,
)
converter.convert_checkpoint(config)
To convert the LoRA model, the ConversionConfig
should specify the base model
options as well as additional LoRA options. Notice that since the API only
supports LoRA inference with GPU, the backend must be set to 'gpu'
.
import mediapipe as mp
from mediapipe.tasks.python.genai import converter
config = converter.ConversionConfig(
# Other params related to base model
...
# Must use gpu backend for LoRA conversion
backend='gpu',
# LoRA related params
lora_ckpt=LORA_CKPT,
lora_rank=LORA_RANK,
lora_output_tflite_file=LORA_OUTPUT_TFLITE_FILE,
)
converter.convert_checkpoint(config)
The converter will output two TFLite flatbuffer files, one for the base model and the other for the LoRA model.
Parameter | Description | Accepted Values |
---|---|---|
input_ckpt |
The path to the model.safetensors or pytorch.bin file. Note that sometimes the model safetensors format are sharded into multiple files, e.g. model-00001-of-00003.safetensors , model-00001-of-00003.safetensors . You can specify a file pattern, like model*.safetensors . |
PATH |
ckpt_format |
The model file format. | {"safetensors", "pytorch"} |
model_type |
The LLM being converted. | {"PHI_2", "FALCON_RW_1B", "STABLELM_4E1T_3B", "GEMMA_2B", "GEMMA_7B", "GEMMA-2_2B"} |
backend |
The processor (delegate) used to run the model. | {"cpu", "gpu"} |
output_dir |
The path to the output directory that hosts the per-layer weight files. | PATH |
output_tflite_file |
The path to the output file. For example, "model_cpu.bin" or "model_gpu.bin". This file is only compatible with the LLM Inference API, and cannot be used as a general `tflite` file. | PATH |
vocab_model_file |
The path to the directory that stores the tokenizer.json and
tokenizer_config.json files. For Gemma, point to the single tokenizer.model file. |
PATH |
lora_ckpt |
The path to the LoRA ckpt of safetensors file that stores the LoRA adapter weight. | PATH |
lora_rank |
An integer representing the rank of LoRA ckpt. Required in order to convert the lora weights. If not provided, then the converter assumes there are no LoRA weights. Note: Only the GPU backend supports LoRA. | Integer |
lora_output_tflite_file |
Output tflite filename for the LoRA weights. | PATH |
AI Edge model conversion
If you are using an LLM mapped to a TFLite model through AI Edge, use our bundling script to create a Task Bundle. The bundling process packs the mapped model with additional metadata (e.g., Tokenizer Parameters) needed to run end-to-end inference.
The model bundling process requires the MediaPipe PyPI package. The conversion
script is available in all MediaPipe packages after 0.10.14
.
Install and import the dependencies with the following:
$ python3 -m pip install mediapipe
Use the genai.bundler
library to bundle the model:
import mediapipe as mp
from mediapipe.tasks.python.genai import bundler
config = bundler.BundleConfig(
tflite_model=TFLITE_MODEL,
tokenizer_model=TOKENIZER_MODEL,
start_token=START_TOKEN,
stop_tokens=STOP_TOKENS,
output_filename=OUTPUT_FILENAME,
enable_bytes_to_unicode_mapping=ENABLE_BYTES_TO_UNICODE_MAPPING,
)
bundler.create_bundle(config)
Parameter | Description | Accepted Values |
---|---|---|
tflite_model |
The path to the AI Edge exported TFLite model. | PATH |
tokenizer_model |
The path to the SentencePiece tokenizer model. | PATH |
start_token |
Model specific start token. The start token must be present in the provided tokenizer model. | STRING |
stop_tokens |
Model specific stop tokens. The stop tokens must be present in the provided tokenizer model. | LIST[STRING] |
output_filename |
The name of the output task bundle file. | PATH |
Add model to project directory
Store the model within your project directory:
<dev-project-root>/assets/gemma-2b-it-gpu-int4.bin
Specify the path of the model with the baseOptions
object modelAssetPath
parameter:
baseOptions: { modelAssetPath: `/assets/gemma-2b-it-gpu-int4.bin`}
Create the task
Use one of the LLM Inference API createFrom...()
functions to prepare the task for
running inferences. You can use the createFromModelPath()
function with a
relative or absolute path to the trained model file. The code example uses the
createFromOptions()
function. For more information on the available
configuration options, see Configuration options.
The following code demonstrates how to build and configure this task:
const genai = await FilesetResolver.forGenAiTasks(
// path/to/wasm/root
"https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@latest/wasm"
);
llmInference = await LlmInference.createFromOptions(genai, {
baseOptions: {
modelAssetPath: '/assets/gemma-2b-it-gpu-int4.bin'
},
maxTokens: 1000,
topK: 40,
temperature: 0.8,
randomSeed: 101
});
Configuration options
This task has the following configuration options for Web and JavaScript apps:
Option Name | Description | Value Range | Default Value |
---|---|---|---|
modelPath |
The path to where the model is stored within the project directory. | PATH | N/A |
maxTokens |
The maximum number of tokens (input tokens + output tokens) the model handles. | Integer | 512 |
topK |
The number of tokens the model considers at each step of generation. Limits predictions to the top k most-probable tokens. | Integer | 40 |
temperature |
The amount of randomness introduced during generation. A higher temperature results in more creativity in the generated text, while a lower temperature produces more predictable generation. | Float | 0.8 |
randomSeed |
The random seed used during text generation. | Integer | 0 |
loraRanks |
LoRA ranks to be used by the LoRA models during runtime. Note: this is only compatible with GPU models. | Integer array | N/A |
Prepare data
LLM Inference API accepts text (string
) data. The task handles the data input
preprocessing, including tokenization and tensor preprocessing.
All preprocessing is handled within the generateResponse()
function. There is
no need for additional preprocessing of the input text.
const inputPrompt = "Compose an email to remind Brett of lunch plans at noon on Saturday.";
Run the task
The LLM Inference API uses the generateResponse()
function to trigger inferences.
For text classification, this means returning the possible categories for the
input text.
The following code demonstrates how to execute the processing with the task model.
const response = await llmInference.generateResponse(inputPrompt);
document.getElementById('output').textContent = response;
To stream the response, use the following:
llmInference.generateResponse(
inputPrompt,
(partialResult, done) => {
document.getElementById('output').textContent += partialResult;
});
Handle and display results
The LLM Inference API returns a string, which includes the generated response text.
Here's a draft you can use:
Subject: Lunch on Saturday Reminder
Hi Brett,
Just a quick reminder about our lunch plans this Saturday at noon.
Let me know if that still works for you.
Looking forward to it!
Best,
[Your Name]
LoRA model customization
Mediapipe LLM inference API can be configured to support Low-Rank Adaptation (LoRA) for large language models. Utilizing fine-tuned LoRA models, developers can customize the behavior of LLMs through a cost-effective training process.
LoRA support of the LLM Inference API works for Gemma-2B and Phi-2 models for the GPU backend, with LoRA weights applicable to attention layers only. This initial implementation serves as an experimental API for future developments with plans to support more models and various types of layers in the coming updates.
Prepare LoRA models
Follow the instructions on HuggingFace to train a fine tuned LoRA model on your own dataset with supported model types, Gemma-2B or Phi-2. Gemma-2B and Phi-2 models are both available on HuggingFace in the safetensors format. Since LLM Inference API only supports LoRA on attention layers, only specify attention layers while creating the LoraConfig
as following:
# For Gemma-2B
from peft import LoraConfig
config = LoraConfig(
r=LORA_RANK,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
)
# For Phi-2
config = LoraConfig(
r=LORA_RANK,
target_modules=["q_proj", "v_proj", "k_proj", "dense"],
)
For testing, there are publicly accessible fine-tuned LoRA models which fit LLM Inference API available on HuggingFace. For example, monsterapi/gemma-2b-lora-maths-orca-200k for Gemma-2B and lole25/phi-2-sft-ultrachat-lora for Phi-2.
After training on the prepared dataset and saving the model, you obtain an adapter_model.safetensors
file containing the fine-tuned LoRA model weights. The safetensors file is the LoRA checkpoint used in the model conversion.
As the next step, you need convert the model weights into a TensorFlow Lite Flatbuffer using the MediaPipe Python Package. The ConversionConfig
should specify the base model options as well as additional LoRA options. Notice that since the API only supports LoRA inference with GPU, the backend must be set to 'gpu'
.
import mediapipe as mp
from mediapipe.tasks.python.genai import converter
config = converter.ConversionConfig(
# Other params related to base model
...
# Must use gpu backend for LoRA conversion
backend='gpu',
# LoRA related params
lora_ckpt=LORA_CKPT,
lora_rank=LORA_RANK,
lora_output_tflite_file=LORA_OUTPUT_TFLITE_FILE,
)
converter.convert_checkpoint(config)
The converter will output two TFLite flatbuffer files, one for the base model and the other for the LoRA model.
LoRA model inference
The Web, Android and iOS LLM Inference API are updated to support LoRA model inference. Web supports dynamic LoRA, which can switch different LoRA models during runtime. Android and iOS support static LoRA, which uses the same LoRA weights during the lifetime of the task.
Web supports dynamic LoRA during runtime. That is, users declare the LoRA ranks going to be used during initialization, and can swap different LoRA models during runtime.const genai = await FilesetResolver.forGenAiTasks(
// path/to/wasm/root
"https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@latest/wasm"
);
const llmInference = await LlmInference.createFromOptions(genai, {
// options for the base model
...
// LoRA ranks to be used by the LoRA models during runtime
loraRanks: [4, 8, 16]
});
During runtime, after the base model is initialized, load the LoRA models to be used. Also, trigger the LoRA model by passing the LoRA model reference while generating the LLM response.
// Load several LoRA models. The returned LoRA model reference is used to specify
// which LoRA model to be used for inference.
loraModelRank4 = await llmInference.loadLoraModel(loraModelRank4Url);
loraModelRank8 = await llmInference.loadLoraModel(loraModelRank8Url);
// Specify LoRA model to be used during inference
llmInference.generateResponse(
inputPrompt,
loraModelRank4,
(partialResult, done) => {
document.getElementById('output').textContent += partialResult;
});