These researchers used NPR Sunday Puzzle questions to benchmark AI 'reasoning' models

A team of researchers from various institutions, including Wellesley College and Northeastern University, has developed an AI benchmark using riddles from NPR's Sunday Puzzle to test AI's problem-solving abilities. This benchmark aims to evaluate reasoning models like OpenAI's o1 and DeepSeek's R1, which sometimes struggle with providing correct answers despite their advanced capabilities. The Sunday Puzzle is chosen for its accessibility, requiring only general knowledge and not relying on rote memory, making it a unique challenge for AI models. The study reveals that reasoning models, while thorough in fact-checking, can exhibit human-like behaviors such as frustration and giving up, which affects their performance.

The current best-performing model on this benchmark is o1, with a score of 59%, followed by o3-mini at 47%, and R1 at 35%. The researchers plan to expand their testing to more reasoning models to identify areas for improvement. They emphasize the importance of creating reasoning benchmarks that don't require specialized knowledge, allowing broader access and understanding. This approach could lead to better AI solutions in the future, as state-of-the-art models are increasingly used in everyday settings.

Key takeaways

Researchers created an AI benchmark using riddles from NPR's Sunday Puzzle to test AI problem-solving abilities.
The benchmark reveals that reasoning models like OpenAI's o1 sometimes provide incorrect answers they know aren't right.
The Sunday Puzzle benchmark is designed to test general knowledge and reasoning, avoiding reliance on rote memory.
Current best-performing model on the benchmark is o1 with a score of 59%, followed by o3-mini at 47% and DeepSeek's R1 at 35%.

These researchers used NPR Sunday Puzzle questions to benchmark AI 'reasoning' models | TechCrunch

Key takeaways

Discussion (0)