Google Kubernetes Engine (GKE) is a managed Kubernetes service from Google Cloud that you can use to deploy and operate containerized applications at scale using Google's infrastructure. You can serve Gemma using Cloud Tensor processing units (TPUs) and graphical processing units (GPUs) on GKE with these LLM serving frameworks:
- Serve Gemma using GPUs on GKE with vLLM
- Serve Gemma using GPUs on GKE with TGI
- Serve Gemma using GPUs on GKE with Triton and TensorRT-LLM
- Serve Gemma using TPUs on GKE with JetStream
- Serve Gemma using TPUs on GKE with Saxml
By serving Gemma on GKE, you can implement a robust, production-ready inference serving solution with all the benefits of managed Kubernetes, including efficient scalability and higher availability.
To learn more, refer to the following pages:
- GKE overview: Get started with Google Kubernetes Engine (GKE)
- AI/ML orchestration on GKE: Run optimized AI/ML workloads with GKE