The network infrastructure has been designed to meet the increasing demands of computational density and scale. By segregating frontend and backend networks, employing various routing schemes, and optimizing collective traffic patterns, Meta has been able to build a performant and reliable network infrastructure. The company has also co-designed the collective library and RoCE transport to enforce receiver-driven traffic admission for better performance. The insights gained underline the importance of understanding the training workload and translating these implications into network component design.
Key takeaways:
- Meta has built one of the world's largest AI networks, which plays a crucial role in interconnecting tens of thousands of GPUs, enabling large models with hundreds of billions of parameters such as LLAMA 3.1 405B.
- The company has designed a dedicated backend network specifically for distributed training, allowing it to operate and scale independently from the rest of the data center network.
- Meta has employed various routing schemes and optimized collective traffic patterns to build a performant and reliable network infrastructure. This includes a receiver-driven traffic admission to limit the amount of in-flight traffic on the network, especially when congestion starts to build up.
- Despite the challenges and complexities, the design and operation of large-scale RoCE networks for distributed AI training workloads have evolved to meet the increasing demands of computational density and scale, contributing to the advancement of distributed AI training infrastructure.