bloop | Evaluating LLMs on COBOL

The article discusses the potential of Language Learning Models (LLMs) in writing COBOL, an older 'legacy' language still used in critical systems like US ATM transactions. The authors introduce COBOLEval, the first evaluation benchmark for LLM code completions in COBOL, consisting of 146 coding problems converted from the HumanEval Python generation benchmark. They also present _mAInframer-1_, a series of models fine-tuned to write COBOL, which outperform GPT-4.

The authors note that while LLMs are revolutionizing software writing, their ability to write in COBOL is less known. They found that state-of-the-art LLMs struggle to generate COBOL that compiles, with GPT-4, the best-performing model, only generating a correct solution for 10.27% of problems. However, the _mAInframer-1_ models showed significant improvement, with the 34b model achieving a higher pass rate than GPT-4. The authors hope that COBOLEval will help improve LLM-generated COBOL and maintain the world's supply of critical COBOL code.

Key takeaways:

The article discusses the potential of Language Learning Models (LLMs) in writing COBOL, an older 'legacy' language that still powers critical systems like 95% of US ATM transactions.
COBOLEval, the first evaluation benchmark for LLM code completions in COBOL, is introduced. It consists of 146 coding problems converted into COBOL from the HumanEval Python generation benchmark.
The performance of various models, including GPT-4 and CodeLlama, was tested on COBOLEval. The results showed that these models struggle to generate COBOL that compiles, indicating room for improvement.
The article also introduces mAInframer-1, a series of models fine-tuned to write COBOL, which outperformed the other models in the COBOLEval test.

bloop | Evaluating LLMs on COBOL

Key takeaways:

Comments (0)

Newsletter