Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Introducing gigaGPT: GPT-3 sized models in 565 lines of code - Cerebras

Dec 11, 2023 - cerebras.net
Cerebras has developed gigaGPT, an implementation of Andrei Karpathy’s nanoGPT that can train models with over 100 billion parameters. The system utilizes the large memory and compute capacity of Cerebras hardware to enable large-scale training on vanilla torch.nn code. gigaGPT is designed to be compact and efficient, with the entire repository consisting of just 565 lines of code. It supports long context lengths and works with a variety of optimizers without the need for additional code or third-party frameworks.

The gigaGPT model fits entirely into the system memory of Cerebras hardware, eliminating the need for complex sharding or pipelining techniques. It uses the Cerebras Wafer Scale Clusters, which are comprised of 1 to 192 Cerebras CS-2 systems supported by CPU server nodes that store parameters, data, and an interconnect. The model weights are streamed to the wafer one layer at a time during training, allowing the model to scale from millions to hundreds of billions of parameters without specialized parallelization techniques. The gigaGPT code is simple and easy to modify and customize, making it a significant advancement towards more accessible, scalable, and efficient AI model training.

Key takeaways:

  • GigaGPT is Cerebras’ implementation of Andrei Karpathy’s nanoGPT, capable of training models well over 100B parameters without introducing additional code or relying on third-party frameworks.
  • Unlike other large transformer models, gigaGPT does not require complex frameworks for training and can be implemented with a compact and hackable codebase.
  • The gigaGPT model fits entirely into the system memory of Cerebras hardware, eliminating the need for sharding or pipelining techniques. It uses the Cerebras PyTorch package to simplify the distributed computing needs of the problem.
  • gigaGPT has been validated by training four models with 111M, 13B, 70B, and 175B parameters, and it is believed that it can scale to models in excess of 1 trillion parameters.
View Full Article

Comments (0)

Be the first to comment!