The authors note that while LLMs are revolutionizing software writing, their ability to write in COBOL is less known. They found that state-of-the-art LLMs struggle to generate COBOL that compiles, with GPT-4, the best-performing model, only generating a correct solution for 10.27% of problems. However, the _mAInframer-1_ models showed significant improvement, with the 34b model achieving a higher pass rate than GPT-4. The authors hope that COBOLEval will help improve LLM-generated COBOL and maintain the world's supply of critical COBOL code.
Key takeaways:
- The article discusses the potential of Language Learning Models (LLMs) in writing COBOL, an older 'legacy' language that still powers critical systems like 95% of US ATM transactions.
- COBOLEval, the first evaluation benchmark for LLM code completions in COBOL, is introduced. It consists of 146 coding problems converted into COBOL from the HumanEval Python generation benchmark.
- The performance of various models, including GPT-4 and CodeLlama, was tested on COBOLEval. The results showed that these models struggle to generate COBOL that compiles, indicating room for improvement.
- The article also introduces mAInframer-1, a series of models fine-tuned to write COBOL, which outperformed the other models in the COBOLEval test.