The markdown data provides detailed setup instructions, including the installation of necessary software and the cloning of the relevant GitHub repository. It also outlines the usage of the program, including setting various variables and running the program. The program creates three output files: raw OCR output, LLM corrected output, and LLM corrected output with hallucinations filtered out. The markdown also explains the functions used in the program and the process of hallucination filtering.
Key takeaways:
- The project uses Llama2 to improve the accuracy of Tesseract OCR by converting scanned PDFs into readable text files.
- The process involves converting a PDF into images, applying OCR to each image, and then passing the OCR'ed text through the Llama2 13B Chat model to correct errors and enhance formatting.
- The program offers options to verify if the OCR output is valid English and to reformat the text using markdown. It also has a function to filter potential hallucinations from the LLM corrected text using sentence embeddings and cosine similarity.
- The project is open for contributions and can be a useful tool for doing OCR on challenging files that result in many errors when using regular OCR without any "smart" corrections.