Patronus AI tested four language models using a subset of 150 questions from their FinanceBench dataset, which includes over 10,000 questions and answers from SEC filings of major publicly traded companies. GPT-4-Turbo failed to answer 88% of the questions without access to any SEC source document, but improved significantly when given access to the filings. However, even in "Oracle" mode, where it was pointed to the exact text for the answer, it still produced an incorrect answer 15% of the time. Other models, such as Meta's Llama 2 and Anthropic's Claude 2, also struggled with accuracy, with Llama 2 producing incorrect answers 70% of the time.
Key takeaways:
- Chatbots relying on large language models often fail to accurately answer questions derived from SEC filings, according to researchers from Patronus AI.
- Patronus AI created a test called FinanceBench with over 10,000 questions and answers from SEC filings of major publicly traded companies to evaluate the performance of these AI models.
- OpenAI's GPT-4-Turbo, even when given nearly an entire filing to read, only answered 79% of the questions correctly, while Meta's Llama 2 produced incorrect answers 70% of the time.
- Anthropic's Claude 2 performed well when given 'long context', answering 75% of the questions correctly, while GPT-4-Turbo improved significantly when given access to the underlying filings.