GitHub - Dicklesworthstone/llama2_aided

The project aims to improve Optical Character Recognition (OCR) outputs by using Large Language Models (LLMs), specifically Llama2, to correct errors and enhance text readability. The process involves converting scanned PDFs into images, applying OCR to each image, and then passing the text through the Llama2 model for error correction and formatting. The project also includes a function to filter out potential hallucinations from the corrected text using sentence embeddings and cosine similarity.

The markdown data provides detailed setup instructions, including the installation of necessary software and the cloning of the relevant GitHub repository. It also outlines the usage of the program, including setting various variables and running the program. The program creates three output files: raw OCR output, LLM corrected output, and LLM corrected output with hallucinations filtered out. The markdown also explains the functions used in the program and the process of hallucination filtering.

Key takeaways:

The project uses Llama2 to improve the accuracy of Tesseract OCR by converting scanned PDFs into readable text files.
The process involves converting a PDF into images, applying OCR to each image, and then passing the OCR'ed text through the Llama2 13B Chat model to correct errors and enhance formatting.
The program offers options to verify if the OCR output is valid English and to reformat the text using markdown. It also has a function to filter potential hallucinations from the LLM corrected text using sentence embeddings and cosine similarity.
The project is open for contributions and can be a useful tool for doing OCR on challenging files that result in many errors when using regular OCR without any "smart" corrections.

GitHub - Dicklesworthstone/llama2_aided_tesseract

Key takeaways:

Comments (0)

Newsletter