1
Feature Story
These researchers used NPR Sunday Puzzle questions to benchmark AI 'reasoning' models | TechCrunch
Feb 06, 2025 · techcrunch.com
The current best-performing model on this benchmark is o1, with a score of 59%, followed by o3-mini at 47%, and R1 at 35%. The researchers plan to expand their testing to more reasoning models to identify areas for improvement. They emphasize the importance of creating reasoning benchmarks that don't require specialized knowledge, allowing broader access and understanding. This approach could lead to better AI solutions in the future, as state-of-the-art models are increasingly used in everyday settings.
Key takeaways
- Researchers created an AI benchmark using riddles from NPR's Sunday Puzzle to test AI problem-solving abilities.
- The benchmark reveals that reasoning models like OpenAI's o1 sometimes provide incorrect answers they know aren't right.
- The Sunday Puzzle benchmark is designed to test general knowledge and reasoning, avoiding reliance on rote memory.
- Current best-performing model on the benchmark is o1 with a score of 59%, followed by o3-mini at 47% and DeepSeek's R1 at 35%.