However, the author points out that there are still many gaps in the data, using the consistent mistranslation of the word "you" in Indonesian by various AI models as an example. The author suggests that these models might have been trained on government documents or advertisements, leading to such errors. The author emphasizes the need for better benchmarks, as current ones do not catch these issues, and are mostly focused on Python for code benchmarks.
Key takeaways:
- OpenAI likely employs a large number of people to generate new data for GPT-4, including solving questions to be fed into the system.
- Quality of data is crucial, and feeding the system low-quality data, such as from private conversations, may not yield the best results.
- There are significant gaps in the data, as evidenced by consistent mistranslations of certain words in languages like Indonesian.
- Current benchmarks may not be sufficient to catch all bugs and issues, indicating a need for better benchmarks in the future.