The researchers evaluated over a dozen different popular models, many of which were released in the past year. The results suggest that models aren’t hallucinating much less these days, despite claims to the contrary from OpenAI, Anthropic, and other big generative AI players. The study also found that model size didn’t matter much; smaller models hallucinated roughly as frequently as larger models. The researchers suggest that an interim solution could be programming models to refuse to answer more often. However, they also recommend that vendors should focus more on hallucination-reducing research and involve human experts in the process to verify and validate the information generated by generative AI models.
Key takeaways:
- All generative AI models, including Google’s Gemini, Anthropic’s Claude, and OpenAI’s GPT-4o, are prone to 'hallucinations' or generating mistruths, with the rate and type of mistruths depending on the information they've been exposed to.
- A study from researchers at Cornell, the universities of Washington and Waterloo, and AI2 found that no model performed exceptionally well across all topics, and models that hallucinated the least did so partly by refusing to answer questions they might get wrong.
- Despite claims from AI companies, models are not hallucinating much less these days, with OpenAI’s models being the least hallucinatory overall, followed by Mixtral 8x22B, Command R, and Perplexity’s Sonar models.
- While eliminating hallucinations entirely may not be possible, they can be mitigated through human-in-the-loop fact-checking and citation during a model’s development, and regulations should be developed to ensure human experts are always involved in the process to verify and validate the information generated by generative AI models.