LiveBench is designed to counter the issue of test data contamination by releasing new questions every month. These questions are sourced from recently released datasets, math competitions, arXiv papers, news articles, and IMDb movie synopses. The benchmark has been used to evaluate many prominent closed and open-source models, with top models achieving less than 60% accuracy due to LiveBench's difficulty level. The creators believe LiveBench will make comparing models easier and inform how other scientists build their evaluations in the future.
Key takeaways:
- A team of researchers from Nvidia, Abacus.ai, New York University, the University of Maryland, and the University of Southern California has developed a new benchmark called LiveBench to address limitations with industry incumbents.
- LiveBench offers contamination-free test data and utilizes frequently-updated questions from recent sources, scoring answers automatically according to objective ground-truth values.
- The benchmark covers a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis.
- LiveBench is releasing new questions every month to minimize potential test data contamination, and these queries are sourced using recently released datasets and math competitions, arXiv papers, news articles, and IMDb movie synopses.