Let's reproduce GPT-2 (1.6B): one 8XH100 node, 24 hours, $672, in llm.c · karpathy/llm.c

The blog post discusses the reproduction of GPT-2 in llm.c, a C/CUDA implementation. The author explains that the full 1558M parameter version of GPT-2, which was introduced in OpenAI's blog post in 2019, can now be reproduced on a single 8XH100 node in 24 hours for $672, thanks to improvements in compute, software, and data. However, the author notes that llm.c is not perfectly tuned and stabilized yet, and the evaluations are not comprehensive. The post also provides a detailed guide on how to train a GPT-2 with llm.c, including a breakdown of the arguments passed into the training and a guide on managing memory constraints.

The author also compares the llm.c implementation with a similar run in PyTorch, noting that llm.c uses less memory and is faster. The final model is shared, along with instructions on how to export the model for use with the HuggingFace transformers library. The author concludes by mentioning an attempt at a 400B token run, but does not provide further details.

Key takeaways:

The post discusses the reproduction of GPT-2 in llm.c, a full 1558M parameter version that was introduced in OpenAI's blog post.
The llm.c does so directly in C/CUDA without the typical training stack that would involve the Python interpreter and a significantly more complex deep learning library.
Due to improvements in compute, software, and data, the model can be reproduced on a single 8XH100 node in 24 hours, and for $672.
The post also provides a detailed guide on how to train the GPT-2 with llm.c, including a breakdown of the arguments passed into the training and a guide on how to manage memory constraints.

Let's reproduce GPT-2 (1.6B): one 8XH100 node, 24 hours, $672, in llm.c · karpathy/llm.c · Discussion #677

Key takeaways:

Comments (0)

Newsletter