Arthur unveils Bench, an open-source AI model evaluator

New York-based AI startup Arthur has launched Arthur Bench, an open-source tool for evaluating and comparing the performance of large language models (LLMs) such as OpenAI's GPT-3.5 Turbo and Meta's LLaMA 2. The tool allows companies to test the performance of different language models on their specific use cases, providing metrics to compare models on accuracy, readability, hedging, and other criteria. Arthur Bench also translates academic measures into real-world business impact, using a combination of statistical measures and scores as well as the assessment of other LLMs to grade the response of desired LLMs side by side.

Arthur Bench has already been used by financial-services firms to generate investment theses and analyses more quickly. Vehicle manufacturers have used the tool to create LLMs that can answer customer queries while sourcing information from equipment manuals quickly and accurately. Arthur is open-sourcing Bench so anyone can use and contribute to it for free. The startup also announced a hackathon with Amazon Web Services (AWS) and Cohere to encourage developers to build new metrics for Arthur Bench.

Key takeaways:

AI startup Arthur has launched Arthur Bench, an open-source tool for evaluating and comparing the performance of large language models (LLMs).
Arthur Bench provides metrics to compare models on accuracy, readability, hedging and other criteria, and allows companies to test the performance of different language models on their specific use cases.
Financial-services firms, vehicle manufacturers, and the enterprise media and publishing platform Axios HQ are among those already using Arthur Bench.
Arthur is also hosting a hackathon with Amazon Web Services (AWS) and Cohere to encourage developers to build new metrics for Arthur Bench.

Arthur unveils Bench, an open-source AI model evaluator

Key takeaways:

Comments (0)

Newsletter