Check out the Gemma Cookbook repository for generation and tuning examples! Learn more

Google Kubernetes Engine (GKE) with Gemma

Google Kubernetes Engine (GKE) is a managed Kubernetes service from Google Cloud that you can use to deploy and operate containerized applications at scale using Google's infrastructure. You can serve Gemma using Cloud Tensor processing units (TPUs) and graphical processing units (GPUs) on GKE with these LLM serving frameworks:

Serve Gemma using GPUs on GKE with vLLM
Serve Gemma using GPUs on GKE with TGI
Serve Gemma using GPUs on GKE with Triton and TensorRT-LLM
Serve Gemma using TPUs on GKE with JetStream
Serve Gemma using TPUs on GKE with Saxml

By serving Gemma on GKE, you can implement a robust, production-ready inference serving solution with all the benefits of managed Kubernetes, including efficient scalability and higher availability.

To learn more, refer to the following pages:

GKE overview: Get started with Google Kubernetes Engine (GKE)
AI/ML orchestration on GKE: Run optimized AI/ML workloads with GKE