The IAM database contains 13,353 images of handwritten lines of text created by 657 writers. The texts those writers transcribed are from the Lancaster-Oslo/Bergen Corpus of British English. It includes contributions from 657 writers making a total of 1,539 handwritten pages comprising of 115,320 words and is categorized as part of modern collection. The database is labeled at the sentence, line, and word levels.
168 PAPERS • 2 BENCHMARKS
Form Understanding in Noisy Scanned Documents (FUNSD) comprises 199 real, fully annotated, scanned forms. The documents are noisy and vary widely in appearance, making form understanding (FoUn) a challenging task. The proposed dataset can be used for various tasks, including text detection, optical character recognition, spatial layout analysis, and entity labeling/linking.
142 PAPERS • 3 BENCHMARKS
ST-VQA aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process.
73 PAPERS • NO BENCHMARKS YET
Contains 145k captions for 28k images. The dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects.
64 PAPERS • 1 BENCHMARK
The ICDAR2003 dataset is a dataset for scene text recognition. It contains 507 natural scene images (including 258 training images and 249 test images) in total. The images are annotated at character level. Characters and words can be cropped from the images.
51 PAPERS • 1 BENCHMARK
SciTSR is a large-scale table structure recognition dataset, which contains 15,000 tables in PDF format and their corresponding structure labels obtained from LaTeX source files.
32 PAPERS • NO BENCHMARKS YET
A benchmark dataset that contains 500K document pages with fine-grained token-level annotations for document layout analysis. DocBank is constructed using a simple yet effective way with weak supervision from the \LaTeX{} documents available on the arXiv.com.
28 PAPERS • NO BENCHMARKS YET
TextOCR is a dataset to benchmark text recognition on arbitrary shaped scene-text. TextOCR requires models to perform text-recognition on arbitrary shaped scene-text present on natural images. TextOCR provides ~1M high quality word annotations on TextVQA images allowing application of end-to-end reasoning on downstream tasks such as visual question answering or image captioning.
21 PAPERS • NO BENCHMARKS YET
This dataset includes 4,500 fully annotated images (over 30,000 license plate characters) from 150 vehicles in real-world scenarios where both the vehicle and the camera (inside another vehicle) are moving.
11 PAPERS • 1 BENCHMARK
A prebuilt dataset for OpenAI's task for image-2-latex system. Includes total of ~100k formulas and images splitted into train, validation and test sets. Formulas were parsed from LaTeX sources provided here: http://www.cs.cornell.edu/projects/kddcup/datasets.html(originally from arXiv)
9 PAPERS • 1 BENCHMARK
The Kannada-MNIST dataset is a drop-in substitute for the standard MNIST dataset for the Kannada language.
7 PAPERS • NO BENCHMARKS YET
IIIT-AR-13K is created by manually annotating the bounding boxes of graphical or page objects in publicly available annual reports. This dataset contains a total of 13k annotated page images with objects in five different popular categories - table, figure, natural image, logo, and signature. It is the largest manually annotated dataset for graphical object detection.
6 PAPERS • NO BENCHMARKS YET
MLe2 is a dataset for the evaluation of scene text end-to-end reading systems and all intermediate stages such as text detection, script identification and text recognition. The dataset contains a total of 711 scene images covering four different scripts (Latin, Chinese, Kannada, and Hangul).
This dataset aims at evaluating the License Plate Character Segmentation (LPCS) problem. The experimental results of the paper Benchmark for License Plate Character Segmentation were obtained using a dataset providing 101 on-track vehicles captured during the day. The video was recorded using a static camera in early 2015.
6 PAPERS • 1 BENCHMARK
Chinese Text in the Wild is a dataset of Chinese text with about 1 million Chinese characters from 3850 unique ones annotated by experts in over 30000 street view images. This is a challenging dataset with good diversity containing planar text, raised text, text under poor illumination, distant text, partially occluded text, etc.
5 PAPERS • NO BENCHMARKS YET
Contains video clips shot with modern high-resolution mobile cameras, with strong projective distortions and with low lighting conditions.
This dataset, called RodoSol-ALPR dataset, contains 20,000 images captured by static cameras located at pay tolls owned by the Rodovia do Sol (RodoSol) concessionaire, which operates 67.5 kilometers of a highway (ES-060) in the Brazilian state of Espírito Santo.
The ChineseLP dataset contains 411 vehicle images (mostly of passenger cars) with Chinese license plates (LPs). It consists of 252 images captured by the authors and 159 images downloaded from the internet. The images present great variations in resolution (from 143 × 107 to 2048 × 1536 pixels), illumination and background.
4 PAPERS • 1 BENCHMARK
Twitter100k is a large-scale dataset for weakly supervised cross-media retrieval.
4 PAPERS • NO BENCHMARKS YET
This dataset contains Bangla handwritten numerals, basic characters and compound characters. This dataset was collected from multiple geographical location within Bangladesh and includes sample collected from a variety of aged groups. This dataset can also be used for other classification problems i.e: gender, age, district.
3 PAPERS • 2 BENCHMARKS
The DDI-100 dataset is a synthetic dataset for text detection and recognition based on 7000 real unique document pages and consists of more than 100000 augmented images. The ground truth comprises text and stamp masks, text and characters bounding boxes with relevant annotations.
3 PAPERS • NO BENCHMARKS YET
Arabic handwriting dataset.
3 PAPERS • 1 BENCHMARK
Imgur5k is a large-scale handwritten in-the-wild dataset, containing challenging real world handwritten samples from nearly 5K writers. It consists of ~135K handwritten English words from 5K different images. As opposed to existing dataests for OCR which have limited variability in their images, the images in Imgur5K contain a diverse set of styles.
The largest dataset of extracted visual content from historic newspapers ever produced. The Newspaper Navigator dataset, finetuned visual content recognition model.
MCSCSet is a large-scale specialist-annotated dataset, designed for the task of Medical-domain Chinese Spelling Correction that contains about 200k samples. MCSCSet involves: i) extensive real-world medical queries collected from Tencent Yidian, ii) corresponding misspelled sentences manually annotated by medical specialists.
2 PAPERS • NO BENCHMARKS YET
5 domains: synthetic domain, document domain, street view domain, handwritten domain, and car license domain over five million images
2 PAPERS • 2 BENCHMARKS
This dataset contains 2,000 images taken from inside a warehouse of the Energy Company of Paraná (Copel), which directly serves more than 4 million consuming units in the Brazilian state of Paraná.
2 PAPERS • 1 BENCHMARK
Data collection: Finding a suitable source of data is considered a first step toward building a database. The first step in building a database is finding a suitable source. Here, the main goal is to collect images of Kurdish handwritten characters written by many writers. So, a form is designed to do so. The form is shown in Figure 1. It consists of 1 alphabet at a time letter that has been printed on the top right corner, and it has 125 empty blocks. The writers have been asked to write each letter three times in the three empty blocks. The total number of writers is 390. The forms have been distributed among two main categories: The academic staff of the Information Technology department at Tishk International University, the university students of the University of Kurdistan-Hawler, Salahaddin University, and Tishk International University As shown in Table 2. In total there were ten sets of forms, each set with 35 forms for 35 different letters, at first, we decided that nine sets
1 PAPER • 1 BENCHMARK
This dataset contains 12,500 meter images acquired in the field by the employees of the Energy Company of Paraná (Copel), which directly serves more than 4 million consuming units, across 395 cities and 1,113 locations (i.e., districts, villages and settlements), located in the Brazilian state of Paraná.
Doc3DShade extends Doc3D with realistic lighting and shading. Follows a similar synthetic rendering procedure using captured document 3D shapes but final image generation step combines real shading of different types of paper materials under numerous illumination conditions.
1 PAPER • NO BENCHMARKS YET
Optical images of printed circuit boards as well as detailed annotations of any text, logos, and surface-mount devices (SMDs). There are several hundred samples spanning a wide variety of manufacturing locations, sizes, node technology, applications, and more.
Introduced by Singh, Sumeet S.. “Teaching Machines to Code: Neural Markup Generation with Visual Attention.” ArXiv abs/1802.05415 (2018): n. pag.
It is composed of around 770k of color 256x256 RGB images extracted from the European Union Intellectual Property Office (EUIPO) open registry. Each of them is associated to multiple labels that classify the figurative and textual elements that appear in the images. These annotations have been classified by the EUIPO evaluators using the Vienna classification, a hierarchical classification of figurative marks.
MatriVasha the largest dataset of handwritten Bangla compound characters for research on handwritten Bangla compound character recognition. The proposed dataset contains 120 different types of compound characters that consist of 306,464 images written where 152,950 male and 153,514 female handwritten Bangla compound characters. This dataset can be used for other issues such as gender, age, district base handwriting research because the sample was collected that included district authenticity, age group, and an equal number of men and women.
This paper introduces a new large-scale dataset for Farsi document images, named SUT, which aims to tackle the challenges associated with obtaining diverse and substantial ground-truth data for supervised models in document image analysis (DIA) tasks, like document image classification, text detection and recognition, and information retrieval. The dataset comprises 62,453 images that have been categorized into 21 distinct classes, including identity documents featuring synthetically generated personal information superimposed on various backgrounds. The dataset also includes corresponding files with labeling information for the images. The ground-truth data is organized in CSV files containing image file paths and associated information about the embedded data.
1 PAPER • 2 BENCHMARKS
The UTRSet-Real dataset is a comprehensive, manually annotated dataset specifically curated for Printed Urdu OCR research. It contains over 11,000 printed text line images, each of which has been meticulously annotated. One of the standout features of this dataset is its remarkable diversity, which includes variations in fonts, text sizes, colours, orientations, lighting conditions, noises, styles, and backgrounds. This diversity closely mirrors real-world scenarios, making the dataset highly suitable for training and evaluating models that aim to excel in real-world Urdu text recognition tasks.
The UTRSet-Synth dataset is introduced as a complementary training resource to the UTRSet-Real Dataset, specifically designed to enhance the effectiveness of Urdu OCR models. It is a high-quality synthetic dataset comprising 20,000 lines that closely resemble real-world representations of Urdu text.
WebLI (Web Language Image) is a web-scale multilingual image-text dataset, designed to support Google’s vision-language research, such as the large-scale pre-training for image understanding, image captioning, visual question answering, object detection etc.
Description: 105,941 Images Natural Scenes OCR Data of 12 Languages. The data covers 12 languages (6 Asian languages, 6 European languages), multiple natural scenes, multiple photographic angles. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data. The data can be used for tasks such as OCR of multi-language.
0 PAPER • NO BENCHMARKS YET
This dataset is an extremely challenging set of over 5000+ original Hindi text images captured and crowdsourced from over 700+ urban and rural areas, where each image is manually reviewed and verified by computer vision professionals at DataclusterLabs.
Introduction The dataset consists of Indian traffic signs images for classification and detection. The images have been taken in varied weather conditions in daylight, evening and nights. The dataset has a wide variety of variations of illumination, distances, view points etc. This dataset represents a very challenging set of unstructured images of Indian traffic signboards.
This dataset is an extremely challenging set of over 2000+ original Visiting card/ID card images captured and crowdsourced from over 300+ urban and rural areas, where each image is manually reviewed and verified by computer vision professionals at Datacluster Labs.