The issue extends to languages with a rich literary history, such as Tamil, where ChatGPT fails to produce legible poetry. Despite improvements in the latest version, GPT-4, the model still struggles with complex tasks in languages like Bengali. Researchers argue that while the lack of data for these languages is a broader issue, OpenAI, as a profit-making company, should invest in rectifying these disparities and improving the performance of ChatGPT in underrepresented languages.
Key takeaways:
- ChatGPT, an AI chatbot, struggles to produce high-quality text in underrepresented languages such as Bengali, Swahili, Urdu, Thai, and Tigrinya, often generating fabricated words, illogical answers, and complete nonsense.
- AI language models like ChatGPT are largely trained on data scraped from the internet, which is dominated by English and other major languages, leading to less intelligent responses in low-resource languages.
- OpenAI, the creator of ChatGPT, does not include any language guidelines in its usage policy for the chatbot, and while it has stated it is working toward better performance for languages other than English, it has not shared specifics on these efforts.
- Researchers argue that OpenAI should invest in rectifying these data disparities and creating stronger user guardrails, especially as it profits from ChatGPT and markets it as capable of performing language translation tasks.