PubLayNet is a dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated.
100 PAPERS • 1 BENCHMARK
The RVL-CDIP dataset consists of scanned document images belonging to 16 classes such as letter, form, email, resume, memo, etc. The dataset has 320,000 training, 40,000 validation and 40,000 test images. The images are characterized by low quality, noise, and low resolution, typically 100 dpi.
96 PAPERS • 3 BENCHMARKS
A benchmark dataset that contains 500K document pages with fine-grained token-level annotations for document layout analysis. DocBank is constructed using a simple yet effective way with weak supervision from the \LaTeX{} documents available on the arXiv.com.
28 PAPERS • NO BENCHMARKS YET
The database consists of 150 annotated pages of three different medieval manuscripts with challenging layouts. Furthermore, we provide a layout analysis ground-truth which has been iterated on, reviewed, and refined by an expert in medieval studies.
14 PAPERS • 2 BENCHMARKS
The DSSE-200 is a complex document layout dataset including various dataset styles. The dataset contains 200 images from pictures, PPT, brochure documents, old newspapers and scanned documents.
8 PAPERS • NO BENCHMARKS YET
HJDataset is a large dataset of Historical Japanese Documents with Complex Layouts. It contains over 250,000 layout element annotations of seven types. In addition to bounding boxes and masks of the content regions, it also includes the hierarchical structures and reading orders for layout elements. The dataset is constructed using a combination of human and machine efforts.
5 PAPERS • NO BENCHMARKS YET
We present the VIS30K dataset, a collection of 29,689 images that represents 30 years of figures and tables from each track of the IEEE Visualization conference series (Vis, SciVis, InfoVis, VAST). VIS30K’s comprehensive coverage of the scientific literature in visualization not only reflects the progress of the field but also enables researchers to study the evolution of the state-of-the-art and to find relevant work based on graphical content. We describe the dataset and our semi-automatic collection process, which couples convolutional neural networks (CNN) with curation. Extracting figures and tables semi-automatically allows us to verify that no images are overlooked or extracted erroneously. To improve quality further, we engaged in a peer-search process for high-quality figures from early IEEE Visualization papers.
Revision: v1.0.0-full-20210527a DOI: 10.5281/zenodo.4817662 Authors: J. Chazalon, E. Carlinet, Y. Chen, J. Perret, C. Mallet, B. Duménieu and T. Géraud Official competition website: https://icdar21-mapseg.github.io/
1 PAPER • NO BENCHMARKS YET
The UrduDoc Dataset is a benchmark dataset for Urdu text line detection in scanned documents. It is created as a byproduct of the UTRSet-Real dataset generation process. Comprising 478 diverse images collected from various sources such as books, documents, manuscripts, and newspapers, it offers a valuable resource for research in Urdu document analysis. It includes 358 pages for training and 120 pages for validation, featuring a wide range of styles, scales, and lighting conditions. It serves as a benchmark for evaluating printed Urdu text detection models, and the benchmark results of state-of-the-art models are provided. The Contour-Net model demonstrates the best performance in terms of h-mean.
1 PAPER • 1 BENCHMARK