The author concludes that despite the difficulties, including unreliable compute providers and compute scarcity, they were able to successfully train strong models with few trials and resources. They attribute this success to the strong prior knowledge and intuition built up in their ML careers. The author acknowledges that this is only part of the story of starting a company and promises to share more about data pipelines, human evaluation, and other topics in the future.
Key takeaways:
- The author discusses the challenges faced while building infrastructure and training large language & multimodal models from scratch at Reka, highlighting the instability of compute providers and the variance in the quality of clusters and accelerators.
- They express frustration with the unpredictable quality of hardware from different providers, calling it a "hardware lottery" due to the varying levels of difficulty and pain experienced in training models.
- The author also discusses the difficulties of multi-cluster setups, the lower quality of external codebases compared to those at Google, and the need to abandon systematic scaling of models in favor of more instinctive, "Yolo" runs due to limited resources in a startup environment.
- Despite these challenges, the author emphasizes that they were able to train strong models with few trials and resources, attributing this success to the strong prior knowledge and intuition built up in their ML careers.