The article also highlights the role of synthetic data in training these models. Traditional synthetic data, often anonymized tabular data, is inadequate for training LLMs. However, advancements in the synthetic data sector now allow unstructured data, such as live chat logs or social media threads, to be turned into sizable training datasets. This approach is not only cost-effective but also privacy-safe, fitting the criteria of "responsible AI". The article concludes by recommending a modern AI workflow that includes prototyping in ChatGPT, creating synthetic data, selecting a smaller LLM, training the LLM on the synthetic data, testing, and deploying the model.
Key takeaways:
- Generative AI applications like ChatGPT have high computational costs, which can reach up to $700,000 a day, leading to a trend towards smaller, more specialized Large Language Models (LLMs) like Meta's LLaMA.
- ChatGPT plays a critical role in prototyping models in AI initiatives, but the cost of continuing with it may be too high, hence the shift to smaller LLMs for specific use cases.
- Synthetic data, which has significantly advanced over the past decade, can be used to train these smaller, specialized models in a cost-effective and privacy-safe manner.
- A modern and cost-effective AI workflow involves prototyping in ChatGPT, creating a smaller dataset of real data for synthetic data, selecting a smaller LLM for the production model, training the LLM on synthetic data, testing and retesting the model, and finally deploying the model.