ChatGPT Is Terrible at Checking Its Own Code

A recent study led by Xing Hu at Zhejiang University evaluated ChatGPT's ability to self-assess its code for correctness, vulnerabilities, and repairs. The findings, published in the IEEE Transactions on Software Engineering, reveal that ChatGPT-3.5 often overestimates the quality of its code, with a 57% success rate in generating correct code, 73% in avoiding security vulnerabilities, and 70% in repairing incorrect code. However, it frequently misclassifies its code as correct or secure when it is not, especially when using direct prompts. The study found that using guiding questions improved the detection of errors and vulnerabilities significantly.

The research also tested ChatGPT-4, which showed improvements in code generation and repair over ChatGPT-3.5, but similar issues persisted in self-verification. Both versions exhibited self-contradictory hallucinations, where they initially deemed code correct or secure but later contradicted this during self-checks. Hu emphasizes the importance of integrating ChatGPT's capabilities with human expertise to ensure the quality and reliability of generated code, suggesting that ChatGPT should be seen as a supportive tool rather than a replacement for human developers and testers.

Key takeaways:

ChatGPT-3.5 has a moderate success rate in generating correct code, identifying vulnerabilities, and repairing code, but still makes significant errors.
Guiding questions improve ChatGPT's ability to detect incorrect code, vulnerabilities, and failed repairs compared to direct prompts.
ChatGPT exhibits self-contradictory hallucinations, highlighting the need for cautious evaluation of its outputs.
While ChatGPT-4 shows improvements over ChatGPT-3.5, both versions still require integration with human expertise to ensure code quality and reliability.

ChatGPT Is Terrible at Checking Its Own Code

Key takeaways:

Comments (0)

Newsletter