The ICDAR 2013 dataset consists of 229 training images and 233 testing images, with word-level annotations provided. It is the standard benchmark dataset for evaluating near-horizontal text detection.
229 PAPERS • 3 BENCHMARKS
Form Understanding in Noisy Scanned Documents (FUNSD) comprises 199 real, fully annotated, scanned forms. The documents are noisy and vary widely in appearance, making form understanding (FoUn) a challenging task. The proposed dataset can be used for various tasks, including text detection, optical character recognition, spatial layout analysis, and entity labeling/linking.
142 PAPERS • 3 BENCHMARKS
SciTSR is a large-scale table structure recognition dataset, which contains 15,000 tables in PDF format and their corresponding structure labels obtained from LaTeX source files.
32 PAPERS • NO BENCHMARKS YET
The goal of PubTables-1M is to create a large, detailed, high-quality dataset for training and evaluating a wide variety of models for the tasks of table detection, table structure recognition, and functional analysis. It contains:
14 PAPERS • NO BENCHMARKS YET
To address the need for a standard open domain table benchmark dataset, the author propose a novel weak supervision approach to automatically create the TableBank, which is orders of magnitude larger than existing human labeled datasets for table analysis. Distinct from traditional weakly supervised training set, our approach can obtain not only large scale but also high quality training data.
13 PAPERS • NO BENCHMARKS YET
IIIT-AR-13K is created by manually annotating the bounding boxes of graphical or page objects in publicly available annual reports. This dataset contains a total of 13k annotated page images with objects in five different popular categories - table, figure, natural image, logo, and signature. It is the largest manually annotated dataset for graphical object detection.
6 PAPERS • NO BENCHMARKS YET
Table is a compact and efficient form for summarizing and presenting correlative information in handwritten and printed archival documents, scientific journals, reports, financial statements and so on. Table recognition is fundamental for the extraction of information from structured documents. The ICDAR 2019 cTDaR evaluates two aspects of table analysis: table detection and recognition. The participating methods will be evaluated on a modern dataset and archival documents with printed and handwritten tables present.
2 PAPERS • 1 BENCHMARK
We present TNCR, a new table dataset with varying image quality collected from free open source websites. TNCR dataset can be used for table detection in scanned document images and their classification into 5 different classes.
2 PAPERS • NO BENCHMARKS YET
STDW is a diverse large-scale dataset for table detection with more than seven thousand samples containing a wide variety of table structures collected from many diverse sources.
1 PAPER • 1 BENCHMARK