The studies have not been peer-reviewed and did not test the latest releases of the Gemini models, but they add to concerns that Google has been overpromising the capabilities of its AI models. Other models tested, including OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet, also performed poorly. The researchers suggest that better benchmarks and more third-party critique are needed to counter hyped-up claims about generative AI.
Key takeaways:
- Google's flagship generative AI models, Gemini 1.5 Pro and 1.5 Flash, have been found to struggle with processing and analyzing large amounts of data, contrary to the company's claims.
- Two separate studies found that the models were unable to answer questions about large datasets correctly, with one series of document-based tests showing the models gave the right answer only 40% to 50% of the time.
- Researchers found that the models had difficulty verifying claims that required considering larger portions of a book or the entire book, and struggled with verifying claims about implicit information not explicitly stated in the text.
- Despite these findings, Google continues to advertise the models' context window as a key selling point, leading to accusations that the company is overpromising and under-delivering with Gemini.