FunctionGemma released, a model tuned for function calling! Learn more

Run Gemma content generation and inferences

There are two key decisions to make when you want to run a Gemma model: 1) what Gemma variant you want to run, and 2) what AI execution framework you are going to use to run it? A key issue in making both these decisions has to do with what are hardware you and your users have available to run the model.

This overview helps you navigate these decisions and start working with Gemma models. The general steps for running a Gemma model are as follows:

Choose a framework for running
Select a Gemma variant
Run generation and inference requests

Choose a framework

Gemma models are compatible with a variety of generative AI execution frameworks. One of the key decision making factors in running a Gemma model is what computing resources you have (or will have) available to you to run the model. Most compatible AI frameworks require specialized hardware, such as GPUs or TPUs, to run a Gemma model effectively. Tools such as Google Colab can provide these specialized compute resources on a limited basis. Some AI execution frameworks, such as Ollama and Gemma.cpp, allow you to run Gemma on more common CPUs using x86-compatible or ARM architectures.

Here are guides for running Gemma models with various AI runtime frameworks:

Make sure your intended deployment Gemma model format, such as Keras native format, Safetensors, or GGUF, is supported by your chosen framework.

Select a Gemma variant

Gemma models are available in several variants and sizes, including the foundation or core Gemma models, and more specialized model variants such as PaliGemma and DataGemma, and many variants created by the AI developer community on sites such as Kaggle and Hugging Face. If you are unsure about what variant you should start with, select the latest Gemma core instruction-tuned (IT) model with the lowest number of parameters. This type of Gemma model has low compute requirements and be able to respond to a wide variety of prompts without requiring additional development.

Consider the following factors when choosing a Gemma variant:

Gemma core, and other variant families such as PaliGemma, CodeGemma: Recommend Gemma (core). Gemma variants beyond the core version have the same architecture as the core model, and are trained to perform better at specific tasks. Unless your application or goals align with the specialization of a specific Gemma variant, it is best to start with a Gemma core, or base, model.
Instruction-tuned (IT), pre-trained (PT), fine-tuned (FT), mixed (mix): Recommend IT.
- Instruction-tuned (IT) Gemma variants are models that have been trained to respond to a variety of instructions or requests in human language. These model variants are the best place to start because they can respond to prompts without further model training.
- Pre-trained (PT) Gemma variants are models that have been trained to make inferences about language or other data, but have not been trained to follow human instructions. These models require additional training or tuning to be able to perform tasks effectively, and are meant for researchers or developers who want to study or develop the capabilities of the model and its architecture.
- Fine-tuned (FT) Gemma variants can be considered IT variants, but are typically trained to perform a specific task, or perform well on a specific generative AI benchmark. The PaliGemma variant family includes a number of FT variants.
- Mixed (mix) Gemma variants are versions of PaliGemma models that have been instruction tuned with a variety of instructions and are suitable for general use.
Parameters: Recommend smallest number available. In general, the more parameters a model has, the more capable it is. However, running larger models requires larger and more complex compute resources, and generally slows down development of an AI application. Unless you have already determined that a smaller Gemma model cannot meet your needs, choose a one with a small number of parameters.
Quantization levels: Recommend half precision (16-bit), except for tuning. Quantization is a complex topic that boils down to what size and precision of data, and consequently how much memory a generative AI model uses for calculations and generating responses. After a model is trained with high-precision data, which is typically 32-bit floating point data, models like Gemma can be modified to use lower precision data such as 16, 8 or 4-bit sizes. These quantized Gemma models can still perform well, depending on the complexity of the tasks, while using significantly less compute and memory resources. However, tools for tuning quantized models are limited and may not be available within your chosen AI development framework. Typically, you must fine-tune a model like Gemma at full precision, then quantize the resulting model.

For a list of key, Google-published Gemma models, see the Getting started with Gemma models, Gemma model list.

Run generation and inference requests

After you have selected an AI execution framework and a Gemma variant, you can start running the model, and prompting it to generate content or complete tasks. For more information on how to run Gemma with a specific framework, see the guides linked in the Choose a framework section.

Prompt formatting

All instruction-tuned Gemma variants have specific prompt formatting requirements. Some of these formatting requirements are handled automatically by the framework you use to run Gemma models, but when you are sending prompt data directly to a tokenizer, you must add specific tags, and the tagging requirements can change depending on the Gemma variant you are using. See the following guides for information on Gemma variant prompt formatting and system instructions: