Stanford University's Center for Research on Foundation Models has developed a benchmarking system called the Holistic Evaluation of Language Models (HELM) to compare the performance of these models. It evaluates models based on various aspects including accuracy, efficiency, fairness, and bias in different scenarios. Companies can also have evaluations tailored to their specific needs through services like Arthur AI's Bench. The article emphasizes the importance of transparency from the creators of language models in evaluating their performance.
Key takeaways:
- Financial institutions are exploring large language models such as OpenAI's ChatGPT, Anthropic's Claude, and Cohere's Command, with each offering proprietary models that can be fine-tuned to specific purposes.
- While these models offer potential benefits, they also require institutions to trust these companies with their data. Some companies, like Amazon, offer protections such as encryption in transit and at rest.
- There are also free and open-source models available for use, such as Cerebras' family of language models and Databricks' Dolly. These models can be downloaded and used without sharing data with an AI company.
- Stanford University's Center for Research on Foundation Models has developed the Holistic Evaluation of Language Models (HELM) to benchmark the performance of various language models, testing aspects such as accuracy, efficiency, fairness, and bias.