The startup has developed a set of over 10,000 questions and answers drawn from SEC filings, called FinanceBench, to test the performance of language AI in the financial sector. In tests, GPT-4-Turbo failed to answer 88% of the questions when not given access to any SEC source document, but improved significantly when given access to the underlying filings. Other models tested, including Meta's Llama 2 and Anthropic's Claude 2, also struggled with accuracy. Despite these challenges, the founders of Patronus AI believe there is huge potential for LLMs in the finance industry if AI continues to improve.
Key takeaways:
- Patronus AI, a startup founded by Anand Kannappan and Rebecca Qian, found that large language models often fail to answer questions derived from SEC filings, even when using OpenAI's GPT-4-Turbo.
- The company developed a test called FinanceBench, consisting of over 10,000 questions and answers drawn from SEC filings, to evaluate the performance of AI models in the financial sector.
- When tested, the AI models often failed to answer or produced incorrect answers. For instance, GPT-4-Turbo failed to answer 88% of the 150 questions it was asked in a "closed book" test, and Llama 2, an AI model developed by Meta, produced wrong answers 70% of the time.
- Despite the current shortcomings, the co-founders of Patronus AI believe that there is huge potential for AI models to assist in the finance industry, provided they continue to improve.