The leaderboard shows the performance of various models, with GPT 4 03/14 (Legacy) ranking first with a score of 93. Other models like GPT 4, GPT 3.5 Turbo Instruct, and GPT 3.5 Turbo follow closely. The scores range from a high of 93 to a low of 7, indicating a wide variation in the performance of different models.
Key takeaways:
- The leaderboard is an experiment to address the drawbacks of traditional LLMs benchmarks, which quickly become part of training datasets and are hard to relate to real-world use-cases.
- The dataset used here is dynamic, changes every week, and is composed of crowdsourced real-world prompts.
- GPT-4 is used to grade each model's response against a set of rubrics, with more details available on the about page.
- All the data is stored in a Postgres database and the page shows the raw results, with GPT 4 03/14 (Legacy) currently leading the leaderboard.