How Well Do LLMs Generate Code for Different Application Domains? Benchmark and Evaluation

The article introduces MultiCodeBench, a new benchmark designed to evaluate the code generation performance of large language models (LLMs) across specific application domains. Unlike existing benchmarks that focus on general-purpose scenarios, MultiCodeBench addresses the gap by covering 2,400 programming tasks across 12 popular software development domains and 15 programming languages. The benchmark involves categorizing commonly used frameworks and platforms within each domain and sampling programming problems from relevant GitHub repositories. To ensure task quality and prevent data leakage, annotators rewrite the docstrings for each task, and a static analysis-based dependency parsing tool is developed to extract dependencies for deeper performance analysis.

The study conducts extensive experiments using eleven mainstream LLMs on MultiCodeBench, revealing their code generation performance across different domains. This provides practical insights for developers in selecting suitable LLMs for specific application needs. Additionally, the article analyzes the reasons behind the models' failures in completing software development tasks, offering guidance for model developers to improve domain-specific code generation capabilities.

Key takeaways:

Introduction of MultiCodeBench, a new benchmark for evaluating code generation performance of LLMs in specific application domains.
MultiCodeBench includes 2,400 programming tasks across 12 software development domains and 15 programming languages.
Annotators rewrite docstrings and a static analysis-based dependency parsing tool is used to ensure task quality and enable performance analysis.
Experiments with eleven mainstream LLMs provide insights into their performance across domains and guidance for improving domain-specific code generation.

How Well Do LLMs Generate Code for Different Application Domains? Benchmark and Evaluation

Key takeaways:

Comments (0)

Newsletter