The GENIA corpus is the primary collection of biomedical literature compiled and annotated within the scope of the GENIA project. The corpus was created to support the development and evaluation of information extraction and text mining systems for the domain of molecular biology.
116 PAPERS • 6 BENCHMARKS
ACE 2005 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events by the Linguistic Data Consortium (LDC) with support from the ACE Program and additional assistance from LDC.
62 PAPERS • 9 BENCHMARKS
SCIREX is a document level IE dataset that encompasses multiple IE tasks, including salient entity identification and document level N-ary relation identification from scientific articles. The dataset is annotated by integrating automatic and human annotations, leveraging existing scientific knowledge resources.
32 PAPERS • 2 BENCHMARKS
WikiEvents is a document-level event extraction benchmark dataset which includes complete event and coreference annotation.
26 PAPERS • 2 BENCHMARKS
Ten years (2008-2018) ChFinAnn documents and human-summarized event knowledge bases to conduct the DS-based event labeling. Five event types included: Equity Freeze (EF), Equity Repurchase (ER), Equity Underweight (EU), Equity Overweight (EO) and Equity Pledge (EP), which belong to major events required to be disclosed by the regulator and may have a huge impact on the company value. To ensure the labeling quality, the authors set constraints for matched document-record pairs.
19 PAPERS • 1 BENCHMARK
Aims to extract events and their arguments from multimedia documents. Develops the first benchmark and collect a dataset of 245 multimedia news articles with extensively annotated events and arguments.
14 PAPERS • NO BENCHMARKS YET
French TimeBank, a corpus for French annotated in ISO-TimeML.
6 PAPERS • 1 BENCHMARK
Phee is a dataset for pharmacovigilance comprising over 5000 annotated events from medical case reports and biomedical literature. It is designed for biomedical event extraction tasks.
3 PAPERS • NO BENCHMARKS YET
Title2Event is a large-scale sentence-level dataset for benchmarking Open Event Extraction without restricting event types. Title2Event contains more than 42,000 news titles in 34 topics collected from Chinese web pages.
Catalan TimeBank 1.0 was developed by researchers at Barcelona Media and consists of Catalan texts in the AnCora corpus annotated with temporal and event information according to the TimeML specification language.
2 PAPERS • 1 BENCHMARK
The EDT dataset is designed for corporate event detection and text-based stock prediction (trading strategy) benchmark.
2 PAPERS • NO BENCHMARKS YET
EventNarrative is a knowledge graph-to-text dataset from publicly available open-world knowledge graphs. EventNarrative consists of approximately 230,000 graphs and their corresponding natural language text.
IndiaPoliceEvents is a corpus of 21,391 sentences from 1,257 English-language Times of India articles about events in the state of Gujarat during March 2002. This dataset is used for automated event extraction.
The PEDC is a corpus of 14 episodes of This American Life podcast transcripts that have been annotated for events. The corpus contains excerpts from these episodes (listed in Tabe 1) that are dialogue. The granularity of annotation in this corpus is the token; each token is either annotated as an event, or a nonevent. For more information please download the corpus, and see the annotation guide for more specifics on how we define event, and the README for how the annotations are encoded. Also, much more information regarding the corpus, and its use is in the Automatic extraction of personal events from dialogue paper.
1 PAPER • NO BENCHMARKS YET