Reducto Document Ingestion API

RD-TableBench is an open benchmark developed to evaluate the extraction performance of complex tables. It includes a variety of challenging scenarios such as scanned tables, handwriting, language detection, merged cells, and more. The benchmark was created by Reducto, who employed a team of PhD-level human labelers to manually annotate 1000 complex table images from a diverse set of publicly available documents. The dataset includes examples with different structures, text density, and language.

The evaluation methodology involved testing various tools/methods including Reducto, Azure Document Intelligence, AWS Textract Tables, GPT4o, Google Cloud Document AI, Unstructured, and Chunkr. The evaluation process used a hierarchical alignment approach, treating table comparison as a hierarchical alignment problem, similar to DNA sequence alignment. The final similarity score, normalized between 0 and 1, indicates the level of match between tables. RD-TableBench aims to provide a more diverse set of real-world examples, ensuring accuracy with manual annotations.

Key takeaways:

RD-TableBench is an open benchmark developed to evaluate extraction performance for complex tables, including scenarios like scanned tables, handwriting, language detection, and merged cells.
The data for RD-TableBench was manually annotated by a team of PhD-level human labelers, comprising 1000 complex table images from a diverse set of publicly available documents.
The evaluation methodology involved several tools/methods including Reducto, Azure Document Intelligence, AWS Textract Tables, GPT4o, Google Cloud Document AI, Unstructured, and Chunkr.
The benchmark uses a hierarchical alignment approach for table comparison, treating it as a problem similar to DNA sequence alignment, and uses the Needleman-Wunsch algorithm for measuring table similarity.

Reducto Document Ingestion API

Key takeaways:

Comments (0)

Newsletter