The Google Cloud platform provides many services for deploying and serving Gemma open models, including the following:
Vertex AI
Vertex AI is a Google Cloud platform for rapidly building and scaling machine learning projects without requiring in-house MLOps expertise. Vertex AI provides a console where you can work with a large selection of models and offers end-to-end MLOps capabilities and a serverless experience for streamlined development.
You can use Vertex AI as the downstream application that serves Gemma, which is available in Model Garden, a curated collection of models. For example, you could port weights from a Gemma implementation, and use Vertex AI to serve that version of Gemma to get predictions.
To learn more, refer to the following pages:
- Introduction to Vertex AI: Get started with Vertex AI.
- Gemma with Vertex AI: Use Gemma open models with Vertex AI.
- Fine-tune Gemma using KerasNLP and deploy to Vertex AI: End-to-end notebook to fine-tune Gemma using Keras.
Google Kubernetes Engine (GKE)
Google Kubernetes Engine (GKE) is a managed Kubernetes service from Google Cloud that you can use to deploy and operate containerized applications at scale using Google's infrastructure. You can serve Gemma using Cloud Tensor processing units (TPUs) and graphical processing units (GPUs) on GKE with these LLM serving frameworks:
- Serve Gemma using GPUs on GKE with vLLM
- Serve Gemma using GPUs on GKE with TGI
- Serve Gemma using GPUs on GKE with Triton and TensorRT-LLM
- Serve Gemma using TPUs on GKE with JetStream
- Serve Gemma using TPUs on GKE with Saxml
By serving Gemma on GKE, you can implement a robust, production-ready inference serving solution with all the benefits of managed Kubernetes, including efficient scalability and higher availability.
To learn more, refer to the following pages:
- GKE overview: Get started with Google Kubernetes Engine (GKE)
- AI/ML orchestration on GKE: Run optimized AI/ML workloads with GKE
Dataflow ML
Dataflow ML is a Google Cloud platform for deploying and managing complete machine learning workflows. With Dataflow ML, you can prepare your data for model training with data processing tools, then use models like Gemma to perform local and remote inference with batch and streaming pipelines.
You can use Dataflow ML to seamlessly integrate Gemma into your Apache Beam inference pipelines with a few lines of code, enabling you to ingest data, verify and transform the data, feed text inputs into Gemma, and generate text output.
To learn more, refer to the following pages:
- Use Gemma open models with Dataflow: Get started with Gemma in Dataflow.
- Run inference with a Gemma open model: Tutorial that uses Gemma in an Apache Beam inference pipeline.