Scale AI publishes its first LLM Leaderboards, ranking AI model performance in specific domains

Scale AI Inc. has published its first-ever SEAL Leaderboards, a new ranking system for large language models (LLMs) based on private, curated datasets. The leaderboards show that OpenAI’s GPT family of LLMs ranks first in three of the four initial domains, with Anthropic PBC’s Claude 3 Opus taking first place in the fourth category. Google LLC’s Gemini models also performed well, ranking joint-first with the GPT models in some domains. The rankings were developed by Scale AI’s Safety, Evaluations, and Alignment Lab and aim to maintain neutrality and integrity by not revealing the nature of the prompts used to evaluate LLMs.

The company also highlighted some challenges faced by AI researchers, including the lack of high-quality evaluation datasets, inconsistent reporting of evaluations, unverified expertise of evaluators, and inadequate tooling to understand evaluation results. Scale AI plans to update its rankings multiple times a year and add new frontier models as they become available. It also intends to add new domains to the leaderboards, aiming to become the most trusted third-party evaluator of LLMs.

Key takeaways

Scale AI has published its first-ever SEAL Leaderboards, a new ranking system for large language models (LLMs) based on private, curated, and unexploitable datasets.
The leaderboards show that OpenAI’s GPT family of LLMs ranks first in three of the four initial domains, with Anthropic PBC’s Claude 3 Opus taking first place in the fourth category.
Scale AI developed the SEAL Leaderboards to address the lack of transparency around AI performance and to overcome challenges faced by AI researchers, such as the lack of high-quality evaluation datasets and inconsistent reporting of evaluations.
Scale AI plans to update its rankings multiple times a year and add new frontier models as they become available, as well as new domains to the leaderboards, in its bid to become the most trusted third-party evaluator of LLMs.

Scale AI publishes its first LLM Leaderboards, ranking AI model performance in specific domains - SiliconANGLE

Key takeaways

Discussion (0)