Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Chinese AI models storm Hugging Face's LLM chatbot benchmark leaderboard — Alibaba runs the board as major US…

Jun 29, 2024 - tomshardware.com
Hugging Face has launched its second large language model (LLM) leaderboard, aiming to provide a more challenging standard for testing LLM performance across various tasks. Alibaba's Qwen models have taken the lead in the inaugural rankings, securing three spots in the top ten. The new leaderboard tests language models across four tasks: knowledge testing, reasoning on extremely long contexts, complex math abilities, and instruction following. The benchmarks used for testing include solving 1,000-word murder mysteries and explaining PhD-level questions in layman's terms.

The new leaderboard's frontrunner is Qwen, Alibaba's LLM, which ranks 1st, 3rd, and 10th with its different variants. Other notable entries include Llama3-70B, Meta's LLM, and several smaller open-source projects. Hugging Face's leaderboard does not test closed-source models like ChatGPT to ensure reproducibility of results. The company encourages anyone to submit new models for testing and admission on the leaderboard, with a voting system prioritizing popular new entries. However, the article notes that true artificial "intelligence" is still many years away, as LLM performance is only as good as its training data.

Key takeaways:

  • Hugging Face has released its second LLM leaderboard to rank the best language models, with Alibaba's Qwen models dominating the inaugural rankings.
  • The new leaderboard tests language models across four tasks: knowledge testing, reasoning on extremely long contexts, complex math abilities, and instruction following.
  • Tests are run exclusively on Hugging Face's own computers, powered by 300 Nvidia H100 GPUs, and anyone is free to submit new models for testing and admission on the leaderboard.
  • Some LLMs, including newer variants of Meta's Llama, underperformed in the new leaderboard due to a trend of over-training LLMs only on the first leaderboard's benchmarks, leading to regressing in real-world performance.
View Full Article

Comments (0)

Be the first to comment!