The study conducts extensive experiments using eleven mainstream LLMs on MultiCodeBench, revealing their code generation performance across different domains. This provides practical insights for developers in selecting suitable LLMs for specific application needs. Additionally, the article analyzes the reasons behind the models' failures in completing software development tasks, offering guidance for model developers to improve domain-specific code generation capabilities.
Key takeaways:
```html
- Introduction of MultiCodeBench, a new benchmark for evaluating code generation performance of LLMs in specific application domains.
- MultiCodeBench includes 2,400 programming tasks across 12 software development domains and 15 programming languages.
- Annotators rewrite docstrings and a static analysis-based dependency parsing tool is used to ensure task quality and enable performance analysis.
- Experiments with eleven mainstream LLMs provide insights into their performance across domains and guidance for improving domain-specific code generation.