Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Cost-efficient AI inference with Cloud TPU v5e on GKE

Nov 29, 2023 - cloud.google.com
Google Cloud has introduced the TPU v5e, a purpose-built AI accelerator designed for large-scale model training and inference. The TPU v5e, when used with Google Kubernetes Engine (GKE), allows customers to efficiently manage AI workloads with top-tier training and inference capabilities. Google Cloud's MLPerf™ Inference 3.1 benchmark results showed a 2.7x higher performance per dollar compared to the TPU v4. The same performance was achieved when running the Cloud TPU v5e on GKE clusters, demonstrating the scalability, orchestration, and operational benefits of GKE while maintaining the price-performance of TPU.

The use of GKE with TPUs can reduce the total cost of ownership for inference on TPUs by managing and deploying AI workloads, minimizing costs with autoscaling, provisioning necessary compute resources, ensuring high availability, minimizing disruption, and providing full visibility into TPU applications. A proof of concept was created to demonstrate TPU inference using the GPT-J 6B LLM model with a single-host Saxml model server. This reference architecture demonstrates how to achieve optimal price-performance for large-scale AI inference when operationalizing TPU v5e through the use of GKE.

Key takeaways:

  • Google Cloud TPU v5e is a purpose-built AI accelerator that provides cost-efficiency and performance for large-scale model training and inference, especially when used with Google Kubernetes Engine (GKE).
  • Google Cloud's MLPerf™ Inference 3.1 benchmark results showed a 2.7x higher performance per dollar compared to TPU v4, demonstrating the efficiency of the TPU v5e.
  • GKE brings additional value by reducing the total cost of ownership for inference on TPUs, offering features like autoscaling, automatic provisioning of compute resources, and built-in health monitoring.
  • A reference architecture was created to demonstrate TPU inference using the GPT-J 6B LLM model with a single-host Saxml model server, showing how to achieve optimal price-performance for large-scale AI inference when operationalizing TPU v5e through the use of GKE.
View Full Article

Comments (0)

Be the first to comment!