Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Maintaining large-scale AI capacity at Meta

Jun 16, 2024 - engineering.fb.com
Meta is transforming its data centers to support the growing demands of artificial intelligence (AI) training, with a focus on large generative AI models that require vast resources. The company has built one of the world's largest AI training infrastructures, with plans to scale to 600,000 GPUs in the next year. The infrastructure comprises dozens of AI clusters of varying sizes, running thousands of training jobs daily. The company has faced challenges in this transformation, including the need to reconfigure the fleet without disrupting growth, and ensuring compatibility between software and hardware components.

To maintain these training clusters, Meta uses a technique called maintenance trains, which involves cyclically shutting down small amounts of capacity for upgrades. This approach guarantees capacity availability and allows for gradual rollouts of new components. Meta also uses a work orchestrator called OpsPlanner to handle overlapping operations and ensure upgrades are applied before hosts return to production. The company has a deep stack of safety features to handle any issues that arise. Meta is committed to rapid innovation and aims to continue leading in the generative AI space.

Key takeaways:

  • Meta has built one of the world’s largest AI training infrastructures, with plans to scale to 600,000 GPUs in the next year, to support the growing computational needs of AI workloads.
  • Meta uses a technique called maintenance trains to maintain its fleet of clusters, ensuring all capacity minus one maintenance domain is up and running 24/7.
  • Meta has developed a work orchestrator called OpsPlanner, which handles a million operations per day, to ensure hosts have the correct upgrades applied before entering production.
  • Meta is dedicated to rapid innovation and building foundational infrastructure to lead in the generative AI space, with a focus on creating technologies that have a positive societal impact.
View Full Article

Comments (0)

Be the first to comment!