The research also tested ChatGPT-4, which showed improvements in code generation and repair over ChatGPT-3.5, but similar issues persisted in self-verification. Both versions exhibited self-contradictory hallucinations, where they initially deemed code correct or secure but later contradicted this during self-checks. Hu emphasizes the importance of integrating ChatGPT's capabilities with human expertise to ensure the quality and reliability of generated code, suggesting that ChatGPT should be seen as a supportive tool rather than a replacement for human developers and testers.
Key takeaways:
- ChatGPT-3.5 has a moderate success rate in generating correct code, identifying vulnerabilities, and repairing code, but still makes significant errors.
- Guiding questions improve ChatGPT's ability to detect incorrect code, vulnerabilities, and failed repairs compared to direct prompts.
- ChatGPT exhibits self-contradictory hallucinations, highlighting the need for cautious evaluation of its outputs.
- While ChatGPT-4 shows improvements over ChatGPT-3.5, both versions still require integration with human expertise to ensure code quality and reliability.