98% GPU Utilization Achieved in 1k GPU-Scale AI Training Using Distributed Cache

In September 2023, MLPerf introduced its Storage Benchmark for large-scale performance testing of storage systems in AI model training scenarios. The benchmark supports BERT and UNet3D model training but does not support large language models like GPT and LLaMA. However, insights can still be gained from BERT training results. High-performance storage vendors such as DataDirect Networks, Nutanix, Weka, and Argonne National Laboratory released MLPerf test results. JuiceFS Enterprise Edition, a high-performance distributed file system, was also tested, maintaining GPU utilization of over 97% for UNet3D and over 98% for BERT.

JuiceFS Enterprise Edition is a parallel file system based on object storage. It was deployed on the cloud for testing, using object storage as the data persistent layer, with a metadata cluster of three nodes and a distributed cache cluster of multiple nodes. In BERT testing, JuiceFS maintained over 98% GPU utilization in 1,000 GPU-scale training. In UNet3D testing, JuiceFS maintained over 97% GPU utilization in training approaching 500 GPUs. The distributed cache’s advantage is its strong scalability, improving the overall storage system's read bandwidth.

Key takeaways:

In September 2023, MLPerf introduced its Storage Benchmark, a large-scale performance testing for storage systems in AI model training scenarios.
High-performance storage vendors such as DataDirect Networks, Nutanix, Weka, and Argonne National Laboratory released MLPerf test results as industry references.
JuiceFS Enterprise Edition, a high-performance distributed file system, maintained GPU utilization of over 97% for UNet3D at a 500-card scale and over 98% for BERT at a 1,000-card scale.
JuiceFS uses distributed cache to greatly improve the system's I/O throughput and uses inexpensive object storage for data storage, making it more suitable for the overall needs of large-scale AI applications.

98% GPU Utilization Achieved in 1k GPU-Scale AI Training Using Distributed Cache

Key takeaways:

Comments (0)

Newsletter