The new leaderboard's frontrunner is Qwen, Alibaba's LLM, which ranks 1st, 3rd, and 10th with its different variants. Other notable entries include Llama3-70B, Meta's LLM, and several smaller open-source projects. Hugging Face's leaderboard does not test closed-source models like ChatGPT to ensure reproducibility of results. The company encourages anyone to submit new models for testing and admission on the leaderboard, with a voting system prioritizing popular new entries. However, the article notes that true artificial "intelligence" is still many years away, as LLM performance is only as good as its training data.
Key takeaways:
- Hugging Face has released its second LLM leaderboard to rank the best language models, with Alibaba's Qwen models dominating the inaugural rankings.
- The new leaderboard tests language models across four tasks: knowledge testing, reasoning on extremely long contexts, complex math abilities, and instruction following.
- Tests are run exclusively on Hugging Face's own computers, powered by 300 Nvidia H100 GPUs, and anyone is free to submit new models for testing and admission on the leaderboard.
- Some LLMs, including newer variants of Meta's Llama, underperformed in the new leaderboard due to a trend of over-training LLMs only on the first leaderboard's benchmarks, leading to regressing in real-world performance.