Run Gemma with Ollama

Running generative artificial intelligence (AI) models like Gemma can be challenging without the right hardware. Open source frameworks such as llama.cpp and Ollama make this easier by setting up a pre-configured runtime environment that lets you to run versions of Gemma with less compute resources. In fact, using llama.cpp and Ollama you can run versions of Gemma on a laptop or other small computing device without a graphics processing unit (GPU).

In order to run Gemma models with less compute resources, the llama.cpp and Ollama frameworks use quantized versions of the models in the GPT-Generated Unified Format (GGUF) model file format. These quantized models are modified to process requests using smaller, less precise data. Using less precise data in quantized models to process requests typically lowers the quality of the models output, but with the benefit of also lowering the compute resource costs.

This guide describes how to set up and use Ollama to run Gemma to generate text responses.

Setup

This section describes how to set up Ollama and prepare a Gemma model instance to respond to requests, including requesting model access, installing software, and configuring a Gemma model in Ollama.

Get access to Gemma models

Before working with Gemma models, make sure you have requested access via Kaggle and reviewed the Gemma terms of use.

Install Ollama

Before you can use Gemma with Ollama, you must download and install the Ollama software on your computing device.

To download and install Ollama:

  1. Navigate to the download page: https://ollama.com/download
  2. Select your operating system, click the Download button or follow the instructions on the download page.
  3. Install the application by running the installer.
    • Windows: Run the installer *.exe file and follow the instructions.
    • Mac: Unpack the zip package and move the Ollama application folder to your Applications directory.
    • Linux: Follow the instructions in bash script installer.
  4. Confirm that Ollama is installed by opening a terminal window and entering the following command:

    ollama --version
    

You should see a response similar to: ollama version is #.#.##. If you do not get this result, make sure that the Ollama executable is added to your operating system path.

Configure Gemma in Ollama

The Ollama installation package does not include any models by default. You download a model using the pull command.

To configure Gemma in Ollama:

  1. Download and configure the default Gemma 2 variant by opening a terminal window and entering the following command:

    ollama pull gemma2
    
  2. After completing the download you can confirm the model is available with the following command:

    ollama list
    

By default, Ollama downloads the 9 billion parameter, 4-bit quantized (Q4_0) Gemma model variant. You can also download and use other sizes of the Gemma model by specifying a parameter size.

Models are specified as <model_name>:<tag>. For the Gemma 2, 2 billion parameter model, enter gemma2:2b. For the 27 billion parameter model, enter gemma2:27b. You can find the available tags on the Ollama website, including Gemma 2 and Gemma.

Generate responses

When you finish installing a Gemma model in Ollama, you can generate responses immediately using Ollama's command line interface run command. Ollama also configures a web service for accessing the model, which you can test using the curl command.

To generate response from the command line:

  • In a terminal window, and entering the following command:

    ollama run gemma2 "roses are red"
    

To generate a response using the Ollama local web service:

  • In a terminal window, and entering the following command:

    curl http://localhost:11434/api/generate -d '{\
      "model": "gemma2",\
      "prompt":"roses are red"\
    }'
    

Tuned Gemma models

Ollama provides a set of official Gemma model variants for immediate use which are quantized and saved in GGUF format. You can use your own tuned Gemma models with Ollama by converting them to GGUF format. Ollama includes some functions to convert tuned models from a Modelfile format to GGUF. For more information on how to convert your tuned model to GGUF, see the Ollama README.

Next steps

Once you have Gemma running with Ollama, you can start experimenting and building solutions with Gemma's generative AI capabilities. The command line interface for Ollama can be useful for building scripting solutions. The Ollama local web service interface can be useful for building experimental and low-volume use applications.