Study suggests that even the best AI models hallucinate a bunch

A recent study by researchers from Cornell, the universities of Washington and Waterloo, and the nonprofit research institute AI2, found that all generative AI models, including Google’s Gemini, Anthropic’s Claude, and OpenAI’s GPT-4o, "hallucinate" or generate false information. The study, which fact-checked these models against authoritative sources on various topics, found that no model performed exceptionally well across all topics. The models that hallucinated the least did so partly because they refused to answer questions they’d otherwise get wrong. The study also found that even the best models can generate hallucination-free text only about 35% of the time.

The researchers evaluated over a dozen different popular models, many of which were released in the past year. The results suggest that models aren’t hallucinating much less these days, despite claims to the contrary from OpenAI, Anthropic, and other big generative AI players. The study also found that model size didn’t matter much; smaller models hallucinated roughly as frequently as larger models. The researchers suggest that an interim solution could be programming models to refuse to answer more often. However, they also recommend that vendors should focus more on hallucination-reducing research and involve human experts in the process to verify and validate the information generated by generative AI models.

Key takeaways:

All generative AI models, including Google’s Gemini, Anthropic’s Claude, and OpenAI’s GPT-4o, are prone to 'hallucinations' or generating mistruths, with the rate and type of mistruths depending on the information they've been exposed to.
A study from researchers at Cornell, the universities of Washington and Waterloo, and AI2 found that no model performed exceptionally well across all topics, and models that hallucinated the least did so partly by refusing to answer questions they might get wrong.
Despite claims from AI companies, models are not hallucinating much less these days, with OpenAI’s models being the least hallucinatory overall, followed by Mixtral 8x22B, Command R, and Perplexity’s Sonar models.
While eliminating hallucinations entirely may not be possible, they can be mitigated through human-in-the-loop fact-checking and citation during a model’s development, and regulations should be developed to ensure human experts are always involved in the process to verify and validate the information generated by generative AI models.

Study suggests that even the best AI models hallucinate a bunch | TechCrunch

Key takeaways:

Comments (0)

Newsletter