RoCE networks for distributed AI training at scale

Meta has developed one of the world's largest AI networks to support large-scale distributed AI training workloads, as detailed in their paper presented at ACM SIGCOMM 2024. The network uses RDMA Over Converged Ethernet version 2 (RoCEv2) as the inter-node communication transport and has evolved from prototypes to the deployment of numerous clusters, each accommodating thousands of GPUs. These clusters support a range of production distributed GPU training jobs, including ranking, content recommendation, content understanding, natural language processing, and GenAI model training.

The network infrastructure has been designed to meet the increasing demands of computational density and scale. By segregating frontend and backend networks, employing various routing schemes, and optimizing collective traffic patterns, Meta has been able to build a performant and reliable network infrastructure. The company has also co-designed the collective library and RoCE transport to enforce receiver-driven traffic admission for better performance. The insights gained underline the importance of understanding the training workload and translating these implications into network component design.

Key takeaways:

Meta has built one of the world's largest AI networks, which plays a crucial role in interconnecting tens of thousands of GPUs, enabling large models with hundreds of billions of parameters such as LLAMA 3.1 405B.
The company has designed a dedicated backend network specifically for distributed training, allowing it to operate and scale independently from the rest of the data center network.
Meta has employed various routing schemes and optimized collective traffic patterns to build a performant and reliable network infrastructure. This includes a receiver-driven traffic admission to limit the amount of in-flight traffic on the network, especially when congestion starts to build up.
Despite the challenges and complexities, the design and operation of large-scale RoCE networks for distributed AI training workloads have evolved to meet the increasing demands of computational density and scale, contributing to the advancement of distributed AI training infrastructure.

RoCE networks for distributed AI training at scale

Key takeaways:

Comments (0)

Newsletter