The company also highlighted some challenges faced by AI researchers, including the lack of high-quality evaluation datasets, inconsistent reporting of evaluations, unverified expertise of evaluators, and inadequate tooling to understand evaluation results. Scale AI plans to update its rankings multiple times a year and add new frontier models as they become available. It also intends to add new domains to the leaderboards, aiming to become the most trusted third-party evaluator of LLMs.
Key takeaways:
- Scale AI has published its first-ever SEAL Leaderboards, a new ranking system for large language models (LLMs) based on private, curated, and unexploitable datasets.
- The leaderboards show that OpenAI’s GPT family of LLMs ranks first in three of the four initial domains, with Anthropic PBC’s Claude 3 Opus taking first place in the fourth category.
- Scale AI developed the SEAL Leaderboards to address the lack of transparency around AI performance and to overcome challenges faced by AI researchers, such as the lack of high-quality evaluation datasets and inconsistent reporting of evaluations.
- Scale AI plans to update its rankings multiple times a year and add new frontier models as they become available, as well as new domains to the leaderboards, in its bid to become the most trusted third-party evaluator of LLMs.