Gemma 3 released with 128K context, image input, and multilingual support! Learn more

Deploy Gemma with Google Cloud

The Google Cloud platform provides many services for deploying and serving Gemma open models, including the following:

Vertex AI
Cloud Run
Google Kubernetes Engine
Dataflow ML

Vertex AI

Vertex AI is a Google Cloud platform for rapidly building and scaling machine learning projects without requiring in-house MLOps expertise. Vertex AI provides a console where you can work with a large selection of models and offers end-to-end MLOps capabilities and a serverless experience for streamlined development.

You can use Vertex AI as the downstream application that serves Gemma, which is available in Model Garden, a curated collection of models. For example, you could port weights from a Gemma implementation, and use Vertex AI to serve that version of Gemma to get predictions.

To learn more, refer to the following pages:

Introduction to Vertex AI: Get started with Vertex AI.
Gemma with Vertex AI: Use Gemma open models with Vertex AI.
Fine-tune Gemma using KerasNLP and deploy to Vertex AI: End-to-end notebook to fine-tune Gemma using Keras.

Cloud Run

Cloud Run is a fully managed platform to run your code, function, or container on top of Google's highly scalable infrastructure.

Cloud Run offers on-demand, fast starting, scale to zero, pay-per-use GPUs allowing you to serve open models like Gemma.

To learn more about running Gemma on Cloud Run, refer to the following pages:

Google Kubernetes Engine (GKE)

Google Kubernetes Engine (GKE) is a managed Kubernetes service from Google Cloud that you can use to deploy and operate containerized applications at scale using Google's infrastructure. You can serve Gemma using Cloud Tensor processing units (TPUs) and graphical processing units (GPUs) on GKE with these LLM serving frameworks:

By serving Gemma on GKE, you can implement a robust, production-ready inference serving solution with all the benefits of managed Kubernetes, including efficient scalability and higher availability.

To learn more, refer to the following pages:

GKE overview: Get started with Google Kubernetes Engine (GKE)
AI/ML orchestration on GKE: Run optimized AI/ML workloads with GKE

Dataflow ML

Dataflow ML is a Google Cloud platform for deploying and managing complete machine learning workflows. With Dataflow ML, you can prepare your data for model training with data processing tools, then use models like Gemma to perform local and remote inference with batch and streaming pipelines.

You can use Dataflow ML to seamlessly integrate Gemma into your Apache Beam inference pipelines with a few lines of code, enabling you to ingest data, verify and transform the data, feed text inputs into Gemma, and generate text output.

To learn more, refer to the following pages:

Use Gemma open models with Dataflow: Get started with Gemma in Dataflow.
Run inference with a Gemma open model: Tutorial that uses Gemma in an Apache Beam inference pipeline.