The LLM Inference API lets you run large language models (LLMs) completely on-device for iOS applications, which you can use to perform a wide range of tasks, such as generating text, retrieving information in natural language form, and summarizing documents. The task provides built-in support for multiple text-to-text large language models, so you can apply the latest on-device generative AI models to your iOS apps.
The task supports the following variants of Gemma: Gemma-2 2B, Gemma 2B, and Gemma 7B. Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. It also supports the following external models: Phi-2, Falcon-RW-1B and StableLM-3B.
In addition to the supported models, users can use Google's AI Edge
Torch to export PyTorch
models into multi-signature LiteRT (tflite
) models, which are bundled with
tokenizer parameters to create Task Bundles that are compatible with the LLM
Inference API.
You can see this task in action with the MediaPipe Studio demo. For more information about the capabilities, models, and configuration options of this task, see the Overview.
Code example
The MediaPipe Tasks example code is a basic implementation of an LLM Inference API app for iOS. You can use the app as a starting point for your own iOS app, or refer to it when modifying an existing app. The LLM Inference API example code is hosted on GitHub.
Download the code
The following instructions show you how to create a local copy of the example code using the git command line tool.
To download the example code:
Clone the git repository using the following command:
git clone https://github.com/google-ai-edge/mediapipe-samples
Optionally, configure your git instance to use sparse checkout, so you have only the files for the LLM Inference API example app:
cd mediapipe git sparse-checkout init --cone git sparse-checkout set examples/llm_inference/ios/
After creating a local version of the example code, you can install the MediaPipe task library, open the project using Xcode and run the app. For instructions, see the Setup Guide for iOS.
Setup
This section describes key steps for setting up your development environment and code projects to use LLM Inference API. For general information on setting up your development environment for using MediaPipe tasks, including platform version requirements, see the Setup guide for iOS.
Dependencies
LLM Inference API uses the MediaPipeTasksGenai
library, which must be installed
using CocoaPods. The library is compatible with both Swift and Objective-C apps
and does not require any additional language-specific setup.
For instructions to install CocoaPods on macOS, refer to the CocoaPods
installation guide.
For instructions on how to create a Podfile
with the necessary pods for your
app, refer to Using
CocoaPods.
Add the MediaPipeTasksGenai
pod in the Podfile
using the following code:
target 'MyLlmInferenceApp' do
use_frameworks!
pod 'MediaPipeTasksGenAI'
pod 'MediaPipeTasksGenAIC'
end
If your app includes unit test targets, refer to the Set Up Guide for
iOS for additional information on setting up
your Podfile
.
Model
The MediaPipe LLM Inference API task requires a trained model that is compatible with this task. For more information on available trained models for LLM Inference API, see the task overview Models section.
Download a model
Download a model and add it to your project directory using Xcode. For instructions on how to add files to your Xcode project, refer to Managing files and folders in your Xcode project.
Before initializing the LLM Inference API, download one of the supported models and store the file within your project directory:
- Gemma-2 2B: The latest version of Gemma family of models. Part of a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models.
- Gemma 2B: Part of a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. Well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning.
- Phi-2: 2.7 billion parameter Transformer model, best suited for the Question-Answer, chat, and code format.
- Falcon-RW-1B: 1 billion parameter causal decoder-only model trained on 350B tokens of RefinedWeb.
- StableLM-3B: 3 billion parameter decoder-only language model pre-trained on 1 trillion tokens of diverse English and code datasets.
In addition to the supported models, you can use Google's AI Edge
Torch to export PyTorch
models into multi-signature LiteRT (tflite
) models. For more information, see
Torch Generative converter for PyTorch models.
We recommend using Gemma-2 2B, which is available on Kaggle Models. For more information on the other available models, see the task overview Models section.
Convert model to MediaPipe format
The LLM Inference API is compatible with two categories types of models, some of which require model conversion. Use the table to identify the required steps method for your model.
Models | Conversion method | Compatible platforms | File type | |
---|---|---|---|---|
Supported models | Gemma 2B, Gemma 7B, Gemma-2 2B, Phi-2, StableLM, Falcon | MediaPipe | Android, iOS, web | .bin |
Other PyTorch models | All PyTorch LLM models | AI Edge Torch Generative library | Android, iOS | .task |
We are hosting the converted .bin
files for Gemma 2B, Gemma 7B, and Gemma-2 2B
on Kaggle. These models can be directly deployed using our LLM Inference API. To
learn how you can convert other models, see the Model
Conversion section.
Create the task
You can create the LLM Inference API task by calling one of its initializers. The
LlmInference(options:)
initializer sets values for the configuration options.
If you don't need a LLM Inference API initialized with customized configuration
options, you can use the LlmInference(modelPath:)
initializer to create a
LLM Inference API with the default options. For more information about configuration
options, see Configuration Overview.
The following code demonstrates how to build and configure this task.
import MediaPipeTasksGenai
let modelPath = Bundle.main.path(forResource: "model",
ofType: "bin")
let options = LlmInferenceOptions()
options.baseOptions.modelPath = modelPath
options.maxTokens = 1000
options.topk = 40
options.temperature = 0.8
options.randomSeed = 101
let llmInference = try LlmInference(options: options)
Configuration options
This task has the following configuration options for iOS apps:
Option Name | Description | Value Range | Default Value |
---|---|---|---|
modelPath |
The path to where the model is stored within the project directory. | PATH | N/A |
maxTokens |
The maximum number of tokens (input tokens + output tokens) the model handles. | Integer | 512 |
topk |
The number of tokens the model considers at each step of generation. Limits predictions to the top k most-probable tokens. | Integer | 40 |
temperature |
The amount of randomness introduced during generation. A higher temperature results in more creativity in the generated text, while a lower temperature produces more predictable generation. | Float | 0.8 |
randomSeed |
The random seed used during text generation. | Integer | 0 |
loraPath |
The absolute path to the LoRA model locally on the device. Note: this is only compatible with GPU models. | PATH | N/A |
Prepare data
LLM Inference API works with text data. The task handles the data input preprocessing, including tokenization and tensor preprocessing.
All preprocessing is handled within the generateResponse(inputText:)
function.
There is no need for additional preprocessing of the input text beforehand.
let inputPrompt = "Compose an email to remind Brett of lunch plans at noon on Saturday."
Run the task
To run the LLM Inference API, use the generateResponse(inputText:)
method. The
LLM Inference API returns the possible categories for the input text.
let result = try LlmInference.generateResponse(inputText: inputPrompt)
To stream the response, use the generateResponseAsync(inputText:)
method.
let resultStream = LlmInference.generateResponseAsync(inputText: inputPrompt)
do {
for try await partialResult in resultStream {
print("\(partialResult)")
}
print("Done")
}
catch {
print("Response error: '\(error)")
}
Handle and display results
The LLM Inference API returns the generated response text.
Here's a draft you can use:
Subject: Lunch on Saturday Reminder
Hi Brett,
Just a quick reminder about our lunch plans this Saturday at noon.
Let me know if that still works for you.
Looking forward to it!
Best,
[Your Name]
LoRA model customization
Mediapipe LLM inference API can be configured to support Low-Rank Adaptation (LoRA) for large language models. Utilizing fine-tuned LoRA models, developers can customize the behavior of LLMs through a cost-effective training process.
LoRA support of the LLM Inference API works for all Gemma variants and Phi-2 models for the GPU backend, with LoRA weights applicable to attention layers only. This initial implementation serves as an experimental API for future developments with plans to support more models and various types of layers in the coming updates.
Prepare LoRA models
Follow the instructions on
HuggingFace
to train a fine tuned LoRA model on your own dataset with supported model types,
Gemma or Phi-2. Gemma-2 2B, Gemma
2B and
Phi-2 models are both available on
HuggingFace in the safetensors format. Since LLM Inference API only supports
LoRA on attention layers, only specify attention layers while creating the
LoraConfig
as following:
# For Gemma
from peft import LoraConfig
config = LoraConfig(
r=LORA_RANK,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
)
# For Phi-2
config = LoraConfig(
r=LORA_RANK,
target_modules=["q_proj", "v_proj", "k_proj", "dense"],
)
For testing, there are publicly accessible fine-tuned LoRA models which fit LLM Inference API available on HuggingFace. For example, monsterapi/gemma-2b-lora-maths-orca-200k for Gemma-2B and lole25/phi-2-sft-ultrachat-lora for Phi-2.
After training on the prepared dataset and saving the model, you obtain an
adapter_model.safetensors
file containing the fine-tuned LoRA model weights.
The safetensors file is the LoRA checkpoint used in the model conversion.
As the next step, you need convert the model weights into a TensorFlow Lite
Flatbuffer using the MediaPipe Python Package. The ConversionConfig
should
specify the base model options as well as additional LoRA options. Notice that
since the API only supports LoRA inference with GPU, the backend must be set to
'gpu'
.
import mediapipe as mp
from mediapipe.tasks.python.genai import converter
config = converter.ConversionConfig(
# Other params related to base model
...
# Must use gpu backend for LoRA conversion
backend='gpu',
# LoRA related params
lora_ckpt=LORA_CKPT,
lora_rank=LORA_RANK,
lora_output_tflite_file=LORA_OUTPUT_TFLITE_FILE,
)
converter.convert_checkpoint(config)
The converter will output two TFLite flatbuffer files, one for the base model and the other for the LoRA model.
LoRA model inference
The Web, Android and iOS LLM Inference API are updated to support LoRA model inference.
iOS supports static LoRA during initialization. To load a LoRA model, users specify the LoRA model path as well as the base LLM.import MediaPipeTasksGenai
let modelPath = Bundle.main.path(forResource: "model",
ofType: "bin")
let loraPath= Bundle.main.path(forResource: "lora_model",
ofType: "bin")
let options = LlmInferenceOptions()
options.modelPath = modelPath
options.maxTokens = 1000
options.topk = 40
options.temperature = 0.8
options.randomSeed = 101
options.loraPath = loraPath
let llmInference = try LlmInference(options: options)
To run LLM inference with LoRA, use the same generateResponse()
or
generateResponseAsync()
methods as the base model.