The author also shares that they have reproduced the 350M parameter model and provides the exact launch command for it. They note that the 350M model required training for 30B tokens, which took about 14 hours and cost around $200. The author mentions plans to reproduce the 740M and 1558M models, and to improve the current code by adding multi-node training support and making it cleaner and better tested. The author also addresses several FAQs, including questions about sampling, chatting with the model, multi-node distributed training, determinism, training in fp8, and compatibility with non-NVIDIA GPUs and CPUs.
Key takeaways:
- The GPT-2 (124M) model can be reproduced in 90 minutes for $20 using llm.c, a C/CUDA code with around 4,000 lines.
- The model can be trained with a single GPU, although it would take longer (4-24 hours depending on the GPU).
- llm.c is efficient at up to ~60% model flops utilization and has potential for further optimization.
- The author also reproduced the 350M parameter model, which took 14 hours and cost around $200.