To address these issues, the authors have released a paper proposing ways to improve the evaluation of AI agents. They recommend implementing cost-controlled evaluations, jointly optimizing accuracy and cost, distinguishing model and downstream benchmarking, preventing shortcuts in agent benchmarks, and improving the standardization and reproducibility of agent benchmarks. They express cautious optimism about the future of AI agent research, citing the growing culture of sharing code and data, and the reality checks provided by product failures.
Key takeaways:
- The development and evaluation of AI agents, systems that take real-world actions like booking flight tickets or fixing software bugs, is a new and challenging field. Current benchmarks and evaluation practices often lead to agents that perform well on tests but not in practical applications.
- The authors propose three clusters of properties that make an AI system more 'agentic': the complexity of the environment and goals, the user interface and level of supervision, and the system design.
- Despite the hype around AI agents, many product launches based on them have failed due to issues with speed and reliability. The authors believe that more research is needed to improve the reliability of AI agents before they can be successful products.
- The authors' paper makes five recommendations for improving the development and evaluation of AI agents, including implementing cost-controlled evaluations, jointly optimizing accuracy and cost, distinguishing model and downstream benchmarking, preventing shortcuts in agent benchmarks, and improving the standardization and reproducibility of agent benchmarks.