The ICDAR 2013 dataset consists of 229 training images and 233 testing images, with word-level annotations provided. It is the standard benchmark dataset for evaluating near-horizontal text detection.
229 PAPERS • 3 BENCHMARKS
The COCO-Text dataset is a dataset for text detection and recognition. It is based on the MS COCO dataset, which contains images of complex everyday scenes. The COCO-Text dataset contains non-text images, legible text images and illegible text images. In total there are 22184 training images and 7026 validation images with at least one instance of legible text.
80 PAPERS • 2 BENCHMARKS
The ICDAR2003 dataset is a dataset for scene text recognition. It contains 507 natural scene images (including 258 training images and 249 test images) in total. The images are annotated at character level. Characters and words can be cropped from the images.
51 PAPERS • 1 BENCHMARK
The Street View Text (SVT) dataset was harvested from Google Street View. Image text in this data exhibits high variability and often has low resolution. In dealing with outdoor street level imagery, we note two characteristics. (1) Image text often comes from business signage and (2) business names are easily available through geographic business searches. These factors make the SVT set uniquely suited for word spotting in the wild: given a street view image, the goal is to identify words from nearby businesses.
34 PAPERS • 1 BENCHMARK
TextOCR is a dataset to benchmark text recognition on arbitrary shaped scene-text. TextOCR requires models to perform text-recognition on arbitrary shaped scene-text present on natural images. TextOCR provides ~1M high quality word annotations on TextVQA images allowing application of end-to-end reasoning on downstream tasks such as visual question answering or image captioning.
21 PAPERS • NO BENCHMARKS YET
Features a large-scale dataset with 12,263 annotated images. Two tasks, namely text localization and end-to-end recognition, are set up. The competition took place from January 20 to May 31, 2017. 23 valid submissions were received from 19 teams.
19 PAPERS • NO BENCHMARKS YET
The CUTE80 dataset is a lightweight collection of images specifically designed for text detection in natural scene images. It contains a total of 13,000 annotated page images across five different popular categories: 1) Table 2) Figure 3) Natural image 4) Logo 5) ignature
16 PAPERS • NO BENCHMARKS YET
The heavily occluded scene text (HOST) dataset is a dataset that contains images of text with occlusions. It is used to improve the recognition performance of occluded text in machine vision applications 1. The dataset is composed of 4832 images that are manually occluded in weak or heavy degrees.
13 PAPERS • 1 BENCHMARK
The IIIT5K dataset contains 5,000 text instance images: 2,000 for training and 3,000 for testing. It contains words from street scenes and from originally-digital images. Every image is associated with a 50 -word lexicon and a 1,000 -word lexicon.
SVTP dataset stands for Scene Text Recognition Datasets. It is a collection of 4 popular Latin/English scene text recognition datasets, namely IIIT5K, SVT, SVTP, and CUTE-80. These datasets only provide case-insensitive annotations and no punctuation marks.
The Weakly Occluded Scene Text (WOST) dataset is a public dataset for scene text segmentation. It is used to generate pixel-level annotations in scene text images 1. The dataset is designed to contain weakly annotated images, which means that the images are not fully annotated with pixel-level labels.
12 PAPERS • 1 BENCHMARK
This dataset includes 4,500 fully annotated images (over 30,000 license plate characters) from 150 vehicles in real-world scenarios where both the vehicle and the camera (inside another vehicle) are moving.
11 PAPERS • 1 BENCHMARK
IIIT-ILST is a dataset and benchmark for scene text recognition for three Indic scripts - Devanagari, Telugu and Malayalam. IIIT-ILST contains nearly 1000 real images per each script which are annotated for scene text bounding boxes and transcriptions.
7 PAPERS • NO BENCHMARKS YET
MLe2 is a dataset for the evaluation of scene text end-to-end reading systems and all intermediate stages such as text detection, script identification and text recognition. The dataset contains a total of 711 scene images covering four different scripts (Latin, Chinese, Kannada, and Hangul).
6 PAPERS • NO BENCHMARKS YET
This dataset aims at evaluating the License Plate Character Segmentation (LPCS) problem. The experimental results of the paper Benchmark for License Plate Character Segmentation were obtained using a dataset providing 101 on-track vehicles captured during the day. The video was recorded using a static camera in early 2015.
6 PAPERS • 1 BENCHMARK
Dataset Description The Greek Sign Language (GSL) is a large-scale RGB+D dataset, suitable for Sign Language Recognition (SLR) and Sign Language Translation (SLT). The video captures are conducted using an Intel RealSense D435 RGB+D camera at a rate of 30 fps. Both the RGB and the depth streams are acquired in the same spatial resolution of 848×480 pixels. To increase variability in the videos, the camera position and orientation is slightly altered within subsequent recordings. Seven different signers are employed to perform 5 individual and commonly met scenarios in different public services. The average length of each scenario is twenty sentences.
5 PAPERS • NO BENCHMARKS YET
This dataset, called RodoSol-ALPR dataset, contains 20,000 images captured by static cameras located at pay tolls owned by the Rodovia do Sol (RodoSol) concessionaire, which operates 67.5 kilometers of a highway (ES-060) in the Brazilian state of Espírito Santo.
The ChineseLP dataset contains 411 vehicle images (mostly of passenger cars) with Chinese license plates (LPs). It consists of 252 images captured by the authors and 159 images downloaded from the internet. The images present great variations in resolution (from 143 × 107 to 2048 × 1536 pixels), illumination and background.
4 PAPERS • 1 BENCHMARK
5 domains: synthetic domain, document domain, street view domain, handwritten domain, and car license domain over five million images
2 PAPERS • 2 BENCHMARKS
The Mobile Turkish Scene Text (MTST 200) dataset consists of 200 indoor and outdoor Turkish scene text images.
2 PAPERS • NO BENCHMARKS YET
This dataset contains 2,000 images taken from inside a warehouse of the Energy Company of Paraná (Copel), which directly serves more than 4 million consuming units in the Brazilian state of Paraná.
2 PAPERS • 1 BENCHMARK
This dataset contains 12,500 meter images acquired in the field by the employees of the Energy Company of Paraná (Copel), which directly serves more than 4 million consuming units, across 395 cities and 1,113 locations (i.e., districts, villages and settlements), located in the Brazilian state of Paraná.
1 PAPER • 1 BENCHMARK
The IC13 dataset contains 561 images: 420 for training and 141 for testing. It inherits data from the IC03 dataset and extends it with new images. Similar to IC03 dataset, the IC13 dataset contains 1,015 cropped text instance images after removing the words with non-alphanumeric characters.
The UTRSet-Real dataset is a comprehensive, manually annotated dataset specifically curated for Printed Urdu OCR research. It contains over 11,000 printed text line images, each of which has been meticulously annotated. One of the standout features of this dataset is its remarkable diversity, which includes variations in fonts, text sizes, colours, orientations, lighting conditions, noises, styles, and backgrounds. This diversity closely mirrors real-world scenarios, making the dataset highly suitable for training and evaluating models that aim to excel in real-world Urdu text recognition tasks.
The UTRSet-Synth dataset is introduced as a complementary training resource to the UTRSet-Real Dataset, specifically designed to enhance the effectiveness of Urdu OCR models. It is a high-quality synthetic dataset comprising 20,000 lines that closely resemble real-world representations of Urdu text.
1 PAPER • NO BENCHMARKS YET
Vehicle-Rear is a novel dataset for vehicle identification that contains more than three hours of high-resolution videos, with accurate information about the make, model, color and year of nearly 3,000 vehicles, in addition to the position and identification of their license plates.
This repository contains datasets and baselines for benchmarking Chinese text recognition. Please see the corresponding paper for more details regarding the datasets, baselines, the empirical study, etc.
0 PAPER • NO BENCHMARKS YET