The motivation behind creating doc2dict is to facilitate the development of another open-source Python package aimed at exploiting Securities & Exchanges Commission data. By providing a generalized document parser that can be customized, doc2dict eliminates the need for numerous specialized parsers for different document types. Additionally, converting documents into dictionary form significantly reduces their size, opening up possibilities for NoSQL database experiments. The creator is also working on making the conversion process more modular by introducing "mapping dicts" that users can adjust for their specific needs.
Key takeaways:
- doc2dict is a Python package that converts HTML and PDF documents into dictionaries while preserving hierarchy and supports table extraction for HTML files.
- The package processes HTML at 500 pages per second and PDF at 200 pages per second, with multithreading limitations due to PDFium.
- It uses a simplified representation of documents as lists of dictionaries and converts them to a hierarchical dictionary using predetermined rules, with plans for modular "mapping dicts" for customization.
- The creator aims to facilitate the exploitation of SEC data and reduce document size significantly, with potential applications in NoSQL database experiments.