Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Berkeley Function Calling Leaderboard

Apr 02, 2024 - gorilla.cs.berkeley.edu
The article discusses the Berkeley Function-Calling Leaderboard (BFCL), a comprehensive evaluation of Large Language Models' (LLMs) ability to call functions and tools. The BFCL was developed to represent most users' function calling use-cases, such as in agents or enterprise workflows. It evaluates function calls in various forms and languages, and even executes these functions to assess the models. The leaderboard also includes cost and latency for all the different models. The article further breaks down the evaluation into two categories: Python and Non-Python, each with its subcategories. It also explains the evaluation metrics used, which include Abstract Syntax Tree (AST) evaluation and Executable Function Evaluation.

The article also highlights the importance of cost and latency in choosing a model for integration. For models from service providers, latency is measured by timing each request to the endpoint, while cost is derived using a specific formula. For locally hosted models, latency is derived by dividing the total time by the number of evaluation dataset entries, and cost is estimated differently. The article concludes by explaining when to use function-call and when to use a prompt, providing insights and function-calling features supported by different models.

Key takeaways:

  • The Berkeley Function-Calling Leaderboard (BFCL) is the first comprehensive evaluation of Large Language Models' (LLMs) ability to call functions and tools, covering various forms and languages.
  • The BFCL includes 2k question-function-answer pairs in multiple languages and diverse application domains, and also investigates function relevance detection.
  • The evaluation process for the leaderboard uses Abstract Syntax Tree (AST) Evaluation and Executable Function Evaluation, and also considers cost and latency for all the different models.
  • The evaluation covers both function-calling and non-function-calling models, with the aim of understanding the performance of different models across popular API call scenarios.
View Full Article

Comments (0)

Be the first to comment!