GitHub - yandex/YaFSDP: YaFSDP: Yet another Fully Sharded Data Parallel

The article introduces YaFSDP, a Sharded Data Parallelism framework designed for transformer-like neural network architectures. It is reportedly up to 20% faster for pre-training LLMs and performs better under high memory pressure conditions, reducing communications and memory operations overhead. The article provides benchmark comparisons between YaFSDP and FSDP, showing that YaFSDP has a relative iteration time decrease across various pre-training setups.

The article also provides examples of LLM training using the YaFSDP framework, which can be found in the 'examples' folder. These examples require a Docker image, which can be built using a script provided. The article concludes by inviting users to report any bugs or ask questions via GitHub, and requests that users cite the codebase if used.

Key takeaways:

YaFSDP is a Sharded Data Parallelism framework designed for transformer-like neural network architectures, offering up to 20% faster pre-training LLMs and better performance in high memory pressure conditions.
YaFSDP reduces communications and memory operations overhead, as demonstrated in benchmarks comparing it with FSDP on various pre-training setups.
Examples of LLM training using YaFSDP can be found in the 'examples' folder of the project, which require a Docker image based on the NVIDIA PyTorch image with some patched libraries.
If any bugs are encountered or questions arise, users are encouraged to open a GitHub issue, and if the codebase is used, citation is requested using a provided BibTeX entry.

GitHub - yandex/YaFSDP: YaFSDP: Yet another Fully Sharded Data Parallel

Key takeaways:

Comments (0)

Newsletter