10 dataset results for Handwritten Text Recognition

The IAM database contains 13,353 images of handwritten lines of text created by 657 writers. The texts those writers transcribed are from the Lancaster-Oslo/Bergen Corpus of British English. It includes contributions from 657 writers making a total of 1,539 handwritten pages comprising of 115,320 words and is categorized as part of modern collection. The database is labeled at the sentence, line, and word levels.

168 PAPERS • 2 BENCHMARKS

Bentham (Bentham project)

Bentham manuscripts refers to a large set of documents that were written by the renowned English philosopher and reformer Jeremy Bentham (1748-1832). Volunteers of the Transcribe Bentham initiative transcribed this collection. Currently, >6 000 documents or > 25 000 pages have been transcribed using this public web platform. For our experiments, we used the BenthamR0 dataset a part of the Bentham manuscripts.

4 PAPERS • 1 BENCHMARK

HKR (Handwritten Kazakh and Russian (HKR) Database for Text Recognition)

The database is written in Cyrillic and shares the same 33 characters. Besides these characters, the Kazakh alphabet also contains 9 additional specific characters. This dataset is a collection of forms. The sources of all the forms in the datasets were generated by LATEX which subsequently was filled out by persons with their handwriting. The database consists of more than 1400 filled forms. There are approximately 63000 sentences, more than 715699 symbols produced by approximately 200 diferent writers. We utilized three different datasets described as following:

4 PAPERS • 1 BENCHMARK

READ 2016 (HTR Dataset ICFHR 2016)

This dataset arises from the READ project (Horizon 2020).

4 PAPERS • 1 BENCHMARK

Digital Peter

Digital Peter is a dataset of Peter the Great's manuscripts annotated for segmentation and text recognition. The dataset may be useful for researchers to train handwriting text recognition models as a benchmark for comparing different models. It consists of 9,694 images and text files corresponding to lines in historical documents. The dataset includes Peter’s handwritten materials covering the period from 1709 to 1713.

3 PAPERS • 1 BENCHMARK

BN-HTRd (BN-HTRd: A Benchmark Dataset for Document Level Offline Bangla Handwritten Text Recognition (HTR))

We introduce a new Dataset (BN-HTRd) for offline Handwritten Text Recognition (HTR) from images of Bangla scripts comprising words, lines, and document-level annotations. The BN-HTRd dataset is based on the BBC Bangla News corpus - which acted as ground truth texts for the handwritings. Our dataset contains a total of 786 full-page images collected from 150 different writers. With a staggering 1,08,181 instances of handwritten words, distributed over 14,383 lines and 23,115 unique words, this is currently the 'largest and most comprehensive dataset' in this field. We also provided the bounding box annotations (YOLO format) for the segmentation of words/lines and the ground truth annotations for full-text, along with the segmented images and their positions. The contents of our dataset came from a diverse news category, and annotators of different ages, genders, and backgrounds, having variability in writing styles. The BN-HTRd dataset can be adopted as a basis for various handwriting c

2 PAPERS • 2 BENCHMARKS

Saint Gall

Saint Gall dataset contains handwritten historical manuscripts written in Latin that date back to the 9th century. It consists of 60 pages, 1 410 text lines and 11 597 words.

2 PAPERS • 1 BENCHMARK

Belfort (The Belfort dataset: Handwritten Text Recognition from Crowdsourced Annotations)

The Belfort dataset This dataset includes minutes of Belfort municipal council drawn up between 1790 and 1946. Documents include deliberations, lists of councillors, convocations, and agendas. It includes 24,105 text-line images that were automatically detected from pages. Up to 4 transcriptions are available for each line image: two from humans, and two from automatic models.

1 PAPER • 1 BENCHMARK

MatriVasha:

MatriVasha: (MatriVasha: Compound Character atasetD)

MatriVasha the largest dataset of handwritten Bangla compound characters for research on handwritten Bangla compound character recognition. The proposed dataset contains 120 different types of compound characters that consist of 306,464‬ images written where 152,950 male and 153,514 female handwritten Bangla compound characters. This dataset can be used for other issues such as gender, age, district base handwriting research because the sample was collected that included district authenticity, age group, and an equal number of men and women.

1 PAPER • NO BENCHMARKS YET

SIMARA (SIMARA: a database for key-value information extraction from full-page handwritten documents)

Description We propose a new database for information extraction from historical handwritten documents. The corpus includes 5,393 finding aids from six different series, dating from the 18th-20th centuries. Finding aids are handwritten documents that contain metadata describing older archives. They are stored in the National Archives of France and are used by archivists to identify and find archival documents.

1 PAPER • 2 BENCHMARKS

Datasets

10 dataset results for Handwritten Text Recognition