Show HN: Doc2dict a fast, open-source document to dict converter

doc2dict is a Python package designed to convert HTML and PDF documents into dictionaries while preserving their hierarchical structure. It supports table extraction for HTML files and operates at a speed of 500 pages per second for HTML and 200 pages per second for PDFs, provided the PDFs have an underlying text structure. The package cannot utilize multithreading due to limitations with PDFium. doc2dict simplifies documents into a list of dictionaries, each representing a text block with features like "bold" and "font-size," and then converts this representation into a hierarchical dictionary using predetermined rules. The package also offers visualization tools to aid in debugging.

The motivation behind creating doc2dict is to facilitate the development of another open-source Python package aimed at exploiting Securities & Exchanges Commission data. By providing a generalized document parser that can be customized, doc2dict eliminates the need for numerous specialized parsers for different document types. Additionally, converting documents into dictionary form significantly reduces their size, opening up possibilities for NoSQL database experiments. The creator is also working on making the conversion process more modular by introducing "mapping dicts" that users can adjust for their specific needs.

Key takeaways:

doc2dict is a Python package that converts HTML and PDF documents into dictionaries while preserving hierarchy and supports table extraction for HTML files.
The package processes HTML at 500 pages per second and PDF at 200 pages per second, with multithreading limitations due to PDFium.
It uses a simplified representation of documents as lists of dictionaries and converts them to a hierarchical dictionary using predetermined rules, with plans for modular "mapping dicts" for customization.
The creator aims to facilitate the exploitation of SEC data and reduce document size significantly, with potential applications in NoSQL database experiments.

Show HN: Doc2dict a fast, open-source document to dict converter – No AI

Key takeaways:

Comments (0)

Newsletter