To maintain these training clusters, Meta uses a technique called maintenance trains, which involves cyclically shutting down small amounts of capacity for upgrades. This approach guarantees capacity availability and allows for gradual rollouts of new components. Meta also uses a work orchestrator called OpsPlanner to handle overlapping operations and ensure upgrades are applied before hosts return to production. The company has a deep stack of safety features to handle any issues that arise. Meta is committed to rapid innovation and aims to continue leading in the generative AI space.
Key takeaways:
- Meta has built one of the world’s largest AI training infrastructures, with plans to scale to 600,000 GPUs in the next year, to support the growing computational needs of AI workloads.
- Meta uses a technique called maintenance trains to maintain its fleet of clusters, ensuring all capacity minus one maintenance domain is up and running 24/7.
- Meta has developed a work orchestrator called OpsPlanner, which handles a million operations per day, to ensure hosts have the correct upgrades applied before entering production.
- Meta is dedicated to rapid innovation and building foundational infrastructure to lead in the generative AI space, with a focus on creating technologies that have a positive societal impact.