The gemma.cpp engine is a lightweight pure C++ inference runtime implementation of the Gemma model. For additional information about Gemma, see the Gemma page. Model weights, including gemma.cpp specific artifacts, are available on Kaggle.
Who is this project for?
Modern LLM inference engines are sophisticated systems, often with bespoke capabilities extending beyond traditional neural network runtimes. With this comes opportunities for research and innovation through co-design of high level algorithms and low-level computation. However, there is a gap between deployment-oriented C++ inference runtimes, which are not designed for experimentation, and Python-centric ML research frameworks, which abstract away low-level computation through compilation..
The gemma.cpp engine provides a minimalist implementation of models across Gemma releases, focusing on simplicity and directness rather than full generality. This is inspired by vertically-integrated C++ model implementations such as ggml, llama.c, and llama.rs.
This library is meant for experimentation and research use cases, exploring the design space of CPU inference and inference algorithms using portable SIMD via the Google Highway library. It is intended to be straightforward to embed in other projects with minimal dependencies and also easily modifiable with a small ~2K LoC core implementation (along with ~4K LoC of supporting utilities).
For production-oriented edge deployments, we recommend standard deployment pathways using mature Python frameworks like JAX, Keras, PyTorch, and Transformers.
Community contributions large and small are welcome. This project follows Google's Open Source Community Guidelines.
Quickstart
To complete this quickstart, you must clone or download gemma.cpp.
System requirements
Before starting, you should have installed:
- CMake
- Clang C++ Compiler
tar
for extracting archives from Kaggle.
If you are building on Windows, you may want to install the Build Tools for Microsoft Visual Studio as well.
Step 1: Identify the model type that you want to use
Gemma.cpp requires that you specify the model type being used, which is a string that identifies
- the model generation
- the size, and
- whether it is instruction-tuned or pretrained.
Valid model types are as follows:
Gemma 3
Model type | Description |
---|---|
gemma3-1b-it |
Gemma3 1B parameters, instruction-tuned |
gemma3-1b-pt |
Gemma3 1B parameters, pretrained |
gemma3-4b-it |
Gemma3 4B parameters, instruction-tuned |
gemma3-4b-pt |
Gemma3 4B parameters, pretrained |
gemma3-12b-it |
Gemma3 12B parameters, instruction-tuned |
gemma3-12b-pt |
Gemma3 12B parameters, pretrained |
gemma3-27b-it |
Gemma3 27B parameters, instruction-tuned |
gemma3-27b-pt |
Gemma3 27B parameters, pretrained |
Gemma 2
Model type | Description |
---|---|
gemma2-2b-it |
Gemma2 2B parameters, instruction-tuned |
gemma2-2b-pt |
Gemma2 2B parameters, pretrained |
gemma2-9b-it |
Gemma2 9B parameters, instruction-tuned |
gemma2-9b-pt |
Gemma2 9B parameters, pretrained |
gemma2-27b-it |
Gemma2 27B parameters, instruction-tuned |
gemma2-27b-pt |
Gemma2 27B parameters, pretrained |
Gemma and Gemma 1.1
Model type | Description |
---|---|
gemma-2b-it |
Gemma 2B parameters, instruction-tuned |
gemma-2b-pt |
Gemma 2B parameters, pretrained |
gemma-7b-it |
Gemma 7B parameters, instruction-tuned |
gemma-7b-pt |
Gemma 7B parameters, pretrained |
Step 2: Obtain model weights and tokenizer from Kaggle
After choosing the model and size you want to use, you will now need to obtain the weights.
Visit the model page for the model you want to use:
Then select Model Variations > Gemma C++
.
On this tab, the Variation
drop-down includes models formatted for use with
Gemma.cpp in a variety of sizes and formats.
For example, Gemma 3 has the following models available:
Model name | Description |
---|---|
1b-it-sfp |
1 billion parameter instruction-tuned model, 8-bit switched floating point |
1b-pt-sfp |
1 billion parameter pre-trained model, 8-bit switched floating point |
4b-it-sfp |
4 billion parameter instruction-tuned model, 8-bit switched floating point |
4b-pt-sfp |
4 billion parameter pre-trained model, 8-bit switched floating point |
12b-it-sfp |
12 billion parameter instruction-tuned model, 8-bit switched floating point |
12b-pt-sfp |
12 billion parameter pre-trained model, 8-bit switched floating point |
27b-it-sfp |
27 billion parameter instruction-tuned model, 8-bit switched floating point |
27b-pt-sfp |
27 billion parameter pre-trained model, 8-bit switched floating point |
NOTE: We recommend starting with 1b-it-sfp
to get up and running.
Step 3: Extract Files
After filling out the consent form, the download should proceed to retrieve a
tar archive file archive.tar.gz
. Extract files from archive.tar.gz
(this can
take a few minutes):
tar -xf archive.tar.gz
This should produce a file containing model weights such as 1b-it-sfp.sbs
and
a tokenizer file (tokenizer.spm
). You may want to move these files to a
convenient directory location (e.g. the build/ directory in this repo).
Step 4: Build
The build system uses CMake. This section describes
how to build the Gemma inference runtime using using cmake
.
To build the gemma inference runtime:
- Create a
build/
directory. From your top level project directory, generate the build files
`(cd build && cmake ..)`
Run
make
to build the./gemma
executable, specifying the number of parallel thread to use with the-j
parameter:`cd build` `make -j 8 gemma`
If this is successful, you should now have a gemma
executable in your
build/
directory.
Step 5: Run
You can now run gemma
from inside the build/
directory.
gemma
has the following required arguments:
Argument | Description | Example value |
---|---|---|
--model |
The model type. | gemma3-1b-it , ... (see above) |
--compressed_weights |
The compressed weights file. | 1b-it-sfp.sbs , ... (see above) |
--tokenizer |
The tokenizer filename. | tokenizer.spm |
gemma
is invoked as:
./gemma \
--tokenizer [tokenizer file] \
--compressed_weights [compressed weights file] \
--model [gemma3-1b-it]
Example invocation for the following configuration:
- Compressed weights file
1b-it-sfp.sbs
(1B instruction-tuned model, 8-bit switched floating point). - Tokenizer file
tokenizer.spm
.
./gemma \
--tokenizer tokenizer.spm \
--compressed_weights 1b-it-sfp.sbs \
--model gemma3-1b-it
Usage
gemma
has different usage modes, controlled by the verbosity flag.
All usage modes are interactive, triggering text generation upon newline input.
Verbosity | Usage mode | Details |
---|---|---|
--verbosity 0 |
Minimal | Only prints generation output. Suitable as a CLI tool. |
--verbosity 1 |
Default | Standard user-facing terminal UI. |
--verbosity 2 |
Detailed | Shows additional developer and debug info. |
Interactive Terminal App
By default, verbosity is set to 1, bringing up a terminal-based interactive
interface when gemma
is invoked with the required arguments:
$ ./gemma [...]
__ _ ___ _ __ ___ _ __ ___ __ _ ___ _ __ _ __
/ _` |/ _ \ '_ ` _ \| '_ ` _ \ / _` | / __| '_ \| '_ \
| (_| | __/ | | | | | | | | | | (_| || (__| |_) | |_) |
\__, |\___|_| |_| |_|_| |_| |_|\__,_(_)___| .__/| .__/
__/ | | | | |
|___/ |_| |_|
tokenizer : tokenizer.spm
compressed_weights : 1b-it-sfp.sbs
model : gemma3-1b-it
weights : [no path specified]
max_tokens : 3072
max_generated_tokens : 2048
*Usage*
Enter an instruction and press enter (%Q quits).
*Examples*
- Write an email to grandma thanking her for the cookies.
- What are some historical attractions to visit around Massachusetts?
- Compute the nth fibonacci number in javascript.
- Write a standup comedy bit about WebGPU programming.
> What are some outdoorsy places to visit around Boston?
[ Reading prompt ] .....................
**Boston Harbor and Islands:**
* **Boston Harbor Islands National and State Park:** Explore pristine beaches, wildlife, and maritime history.
* **Charles River Esplanade:** Enjoy scenic views of the harbor and city skyline.
* **Boston Harbor Cruise Company:** Take a relaxing harbor cruise and admire the city from a different perspective.
* **Seaport Village:** Visit a charming waterfront area with shops, restaurants, and a seaport museum.
**Forest and Nature:**
* **Forest Park:** Hike through a scenic forest with diverse wildlife.
* **Quabbin Reservoir:** Enjoy boating, fishing, and hiking in a scenic setting.
* **Mount Forest:** Explore a mountain with breathtaking views of the city and surrounding landscape.
...
Use images with the interactive terminal
Gemma 3 models and later in 4B parameter sizes and higher support image input as
part of a prompt. You can include an image as part of your prompt with the
--image_file
tag, as follows:
> What breed of cat is this?
--image_file ./images/cat001.ppm
The Gemma.cpp engine supports the Portable Pixmap (PPM) image format. Additional image format support will be available in upcoming releases.
Usage as a Command Line Tool
For using the gemma
executable as a command line tool, it may be useful to
create an alias for gemma.cpp with arguments fully specified:
alias gemma2b="~/gemma.cpp/build/gemma -- --tokenizer
~/gemma.cpp/build/tokenizer.spm --compressed_weights
~/gemma.cpp/build/2b-it-sfp.sbs --model gemma2-2b-it --verbosity 0"
Replace the above paths with your own paths to the model and tokenizer paths from the download.
Here is an example of prompting gemma
with a truncated input (using a
gemma2b
alias like defined above):
cat configs.h | tail -35 | tr '\n' ' ' | xargs -0 echo "What does this C++ code
do: " | gemma2b
NOTE: CLI usage of gemma.cpp is experimental and should take context length limitations into account.
The output of the preceding command should look like:
$ cat configs.h | tail -35 | tr '\n' ' ' | xargs -0 echo "What does this C++ code do: " | gemma2b
[ Reading prompt ] ........................................................................................
The code defines two C++ structs, `ConfigGemma7B` and `ConfigGemma2B`, which are used for configuring a deep learning model.
**ConfigGemma7B**:
* `seq_len`: Stores the length of the sequence to be processed. It's set to 7168.
* `vocab_size`: Stores the size of the vocabulary, which is 256128.
* `n_layers`: Number of layers in the deep learning model. It's set to 28.
* `dim_model`: Dimension of the model's internal representation. It's set to 3072.
* `dim_ffw_hidden`: Dimension of the feedforward and recurrent layers' hidden representations. It's set to 16 * 3072 / 2.
**ConfigGemma2B**:
* `seq_len`: Stores the length of the sequence to be processed. It's also set to 7168.
* `vocab_size`: Size of the vocabulary, which is 256128.
* `n_layers`: Number of layers in the deep learning model. It's set to 18.
* `dim_model`: Dimension of the model's internal representation. It's set to 2048.
* `dim_ffw_hidden`: Dimension of the feedforward and recurrent layers' hidden representations. It's set to 16 * 2048 / 2.
These structs are used to configure a deep learning model with specific parameters for either Gemma7B or Gemma2B architecture.
Use images on the command line
Gemma 3 models and later in 4B parameter sizes and higher support image input as
part of a prompt. You can include an image as part of your command line prompt
with the --image_file
tag, as follows:
$ xargs -0 echo "What breed of cat is this?: --image_file ./images/cat001.ppm" | gemma2b
The Gemma.cpp engine supports the Portable Pixmap (PPM) image format. Additional image format support will be available in upcoming releases.
Usage as a Shared Library
Gemma.cpp can be built as a shared library (.so
on Mac and Linux, .dll
on
Windows) for use with your applications. We provide a C API and C# bindings to
make that integration easier.
To build Gemma.cpp as a shared library:
Create a build directory and generate the build files using
cmake
from the top-level project directory:(cd build && cmake ..)