On the Evaluation of Large Language Models in Unit Test Generation

The article explores the potential of using open-source Large Language Models (LLMs) for automating unit test generation in software development, a task traditionally seen as challenging and time-consuming. While previous research has focused on closed-source LLMs like ChatGPT and CodeX, this study investigates the capabilities of open-source LLMs, which offer advantages such as data privacy protection and have shown superior performance in certain tasks. The research is based on 17 Java projects and evaluates five widely-used open-source LLMs with different structures and parameter sizes, using comprehensive evaluation metrics.

The findings reveal the significant impact of various prompt factors on the performance of LLMs in generating unit tests. The study compares the performance of open-source LLMs with commercial models like GPT-4 and traditional tools like Evosuite, identifying limitations in LLM-based unit test generation. The authors derive a series of implications to guide future research and practical applications of LLMs in this area, emphasizing the importance of effective prompting strategies to maximize the capabilities of LLMs.

Key takeaways:

Open-source LLMs offer advantages in data privacy protection and have demonstrated superior performance in some tasks compared to closed-source LLMs.
Effective prompting is crucial for maximizing the capabilities of LLMs in unit test generation.
The study evaluates five widely-used open-source LLMs across 17 Java projects, comparing their performance to commercial GPT-4 and traditional Evosuite.
Findings highlight significant influences of prompt factors and identify limitations in LLM-based unit test generation, providing implications for future research and practical use.

On the Evaluation of Large Language Models in Unit Test Generation

Key takeaways:

Comments (0)

Newsletter