1
Feature Story
Let's reproduce GPT-2 (1.6B): one 8XH100 node, 24 hours, $672, in llm.c · karpathy/llm.c · Discussion #677
Jul 12, 2024 · news.bensbites.comThe author also compares the llm.c implementation with a similar run in PyTorch, noting that llm.c uses less memory and is faster. The final model is shared, along with instructions on how to export the model for use with the HuggingFace transformers library. The author concludes by mentioning an attempt at a 400B token run, but does not provide further details.
Key takeaways
- The post discusses the reproduction of GPT-2 in llm.c, a full 1558M parameter version that was introduced in OpenAI's blog post.
- The llm.c does so directly in C/CUDA without the typical training stack that would involve the Python interpreter and a significantly more complex deep learning library.
- Due to improvements in compute, software, and data, the model can be reproduced on a single 8XH100 node in 24 hours, and for $672.
- The post also provides a detailed guide on how to train the GPT-2 with llm.c, including a breakdown of the arguments passed into the training and a guide on how to manage memory constraints.