The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger NIST Special Database 3 (digits written by employees of the United States Census Bureau) and Special Database 1 (digits written by high school students) which contain monochrome images of handwritten digits. The digits have been size-normalized and centered in a fixed-size image. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.
6,980 PAPERS • 52 BENCHMARKS
CoNLL-2003 is a named entity recognition dataset released as a part of CoNLL-2003 shared task: language-independent named entity recognition. The data consists of eight files covering two languages: English and German. For each of the languages there is a training file, a development file, a test file and a large file with unannotated data.
637 PAPERS • 16 BENCHMARKS
The CoNLL dataset is a widely used resource in the field of natural language processing (NLP). The term “CoNLL” stands for Conference on Natural Language Learning. It originates from a series of shared tasks organized at the Conferences of Natural Language Learning.
176 PAPERS • 49 BENCHMARKS
The NCBI Disease corpus consists of 793 PubMed abstracts, which are separated into training (593), development (100) and test (100) subsets. The NCBI Disease corpus is annotated with disease mentions, using concept identifiers from either MeSH or OMIM.
140 PAPERS • 3 BENCHMARKS
SciERC dataset is a collection of 500 scientific abstract annotated with scientific entities, their relations, and coreference clusters. The abstracts are taken from 12 AI conference/workshop proceedings in four AI communities, from the Semantic Scholar Corpus. SciERC extends previous datasets in scientific articles SemEval 2017 Task 10 and SemEval 2018 Task 7 by extending entity types, relation types, relation coverage, and adding cross-sentence relations using coreference links.
117 PAPERS • 7 BENCHMARKS
WikiANN, also known as PAN-X, is a multilingual named entity recognition dataset. It consists of Wikipedia articles that have been annotated with LOC (location), PER (person), and ORG (organization) tags in the IOB2 format¹². This dataset serves as a valuable resource for training and evaluating named entity recognition models across various languages.
57 PAPERS • 3 BENCHMARKS
HAREM, an initiative by Linguateca, boasts a Golden Collection—a meticulously curated repository of annotated Portuguese texts. This resource serves as a pivotal benchmark for evaluating systems in recognizing mentioned entities within documents. It stands as a cornerstone, supporting advancements and innovations in Portuguese language processing research, providing a comprehensive foundation for evaluating system performances and fostering ongoing developments in this domain.
1 PAPER • NO BENCHMARKS YET
The MiniHAREM, a reiteration of the 2005 evaluation, used the same methodology and platform. Held from April 3rd to 5th, 2006, it offered participants a 48-hour window to annotate, verify, and submit text collections. Results are available, and the collection used is accessible. Participant lists, submitted outputs, and updated guidelines are provided. Additionally, the HAREM format checker ensures compliance with MiniHAREM directives. Information for the HAREM Meeting, open for registration until June 15th after the Linguateca Summer School in the University of Porto, is also available.
The dataset used to pre-train NuNER from the NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data
This dataset was taken from the SIGARRA information system at the University of Porto (UP). Every organic unit has its own domain and produces academic news. We collected a sample of 1000 news, manually annotating 905 using the Brat rapid annotation tool. This dataset consists of three files. The first is a CSV file containing news published between 2016-12-14 and 2017-03-01. The second file is a ZIP archive containing one directory per organic unit, with a text file and an annotations file per news article. The third file is an XML containing the complete set of news in a similar format to the HAREM dataset format. This dataset is particularly adequate for training named entity recognition models.
0 PAPER • NO BENCHMARKS YET
The Second HAREM was an evaluation exercise in Portuguese Named Entity Recognition. It aims to refine text annotation processes, building on the First HAREM. Challenges include adapting guidelines for new texts and establishing a unified document with directives from both editions.