The article also highlights the importance of cost and latency in choosing a model for integration. For models from service providers, latency is measured by timing each request to the endpoint, while cost is derived using a specific formula. For locally hosted models, latency is derived by dividing the total time by the number of evaluation dataset entries, and cost is estimated differently. The article concludes by explaining when to use function-call and when to use a prompt, providing insights and function-calling features supported by different models.
Key takeaways:
- The Berkeley Function-Calling Leaderboard (BFCL) is the first comprehensive evaluation of Large Language Models' (LLMs) ability to call functions and tools, covering various forms and languages.
- The BFCL includes 2k question-function-answer pairs in multiple languages and diverse application domains, and also investigates function relevance detection.
- The evaluation process for the leaderboard uses Abstract Syntax Tree (AST) Evaluation and Executable Function Evaluation, and also considers cost and latency for all the different models.
- The evaluation covers both function-calling and non-function-calling models, with the aim of understanding the performance of different models across popular API call scenarios.