TACRED is a large-scale relation extraction dataset with 106,264 examples built over newswire and web text from the corpus used in the yearly TAC Knowledge Base Population (TAC KBP) challenges. Examples in TACRED cover 41 relation types as used in the TAC KBP challenges (e.g., per:schools_attended and org:members) or are labeled as no_relation if no defined relation is held. These examples are created by combining available human annotations from the TAC KBP challenges and crowdsourcing.
185 PAPERS • 2 BENCHMARKS
The FewRel (Few-Shot Relation Classification Dataset) contains 100 relations and 70,000 instances from Wikipedia. The dataset is divided into three subsets: training set (64 relations), validation set (16 relations) and test set (20 relations).
170 PAPERS • 4 BENCHMARKS
A more challenging task to investigate two aspects of few-shot relation classification models: (1) Can they adapt to a new domain with only a handful of instances? (2) Can they detect none-of-the-above (NOTA) relations?
38 PAPERS • NO BENCHMARKS YET
This is the Multi-Axis Temporal RElations for Start-points (i.e., MATRES) dataset
13 PAPERS • 2 BENCHMARKS
The Cornell eRulemaking Corpus – CDCP is an argument mining corpus annotated with argumentative structure information capturing the evaluability of arguments. The corpus consists of 731 user comments on Consumer Debt Collection Practices (CDCP) rule by the Consumer Financial Protection Bureau (CFPB); the resulting dataset contains 4931 elementary unit and 1221 support relation annotations. It is a resource for building argument mining systems that can not only extract arguments from unstructured text, but also identify what additional information is necessary for readers to understand and evaluate a given argument. Immediate applications include providing real-time feedback to commenters, specifying which types of support for which propositions can be added to construct better-formed arguments.
11 PAPERS • 3 BENCHMARKS
The Discovery datasets consists of adjacent sentence pairs (s1,s2) with a discourse marker (y) that occurred at the beginning of s2. They were extracted from the depcc web corpus.
9 PAPERS • 1 BENCHMARK
GUM is an open source multilayer English corpus of richly annotated texts from twelve text types. Annotations include:
8 PAPERS • 1 BENCHMARK
RELX is a benchmark dataset for cross-lingual relation classification in English, French, German, Spanish and Turkish.
6 PAPERS • NO BENCHMARKS YET
LabPics Chemistry Dataset
5 PAPERS • NO BENCHMARKS YET
The TACRED-Revisited dataset improves the crowd-sourced TACRED dataset for relation extraction by relabeling the dev and test sets using expert linguistic annotators. Relabeling focuses on the 5K most challenging instances in dev and test, in total, 51.2% of these are corrected. Published at ACL 2020.
5 PAPERS • 1 BENCHMARK
CrossRE is a cross-domain benchmark for Relation Extraction (RE), which comprises six distinct text domains and includes multi-label annotations. The dataset includes meta-data collected during annotation, to include explanations and flags of difficult instances.
4 PAPERS • NO BENCHMARKS YET
The DISRPT 2021 shared task, co-located with CODI 2021 at EMNLP, introduces the second iteration of a cross-formalism shared task on discourse unit segmentation and connective detection, as well as the first iteration of a cross-formalism discourse relation classification task.
3 PAPERS • NO BENCHMARKS YET
The Dr. Inventor Multi-Layer Scientific Corpus (DRI Corpus) includes 40 Computer Graphics papers, selected by domain experts. Each paper of the Corpus has been annotated by three annotators by providing the following layers of annotations, each one characterizing a core aspect of scientific publications:
2 PAPERS • 2 BENCHMARKS
FREDo is a Few-Shot Document-Level Relation Extraction Benchmark based on DocRED and SciERC. The dataset is divided into four subsets: training set (62 relations), validation set (16 relations), in-domain test set (16 relations), and cross-domain test set (7 relations).
PcMSP is a dataset annotated from 305 open access scientific articles for material science information extraction that simultaneously contains the synthesis sentences extracted from the experimental paragraphs, as well as the entity mentions and intra-sentence relations.
2 PAPERS • NO BENCHMARKS YET
The AbstRCT dataset consists of randomized controlled trials retrieved from the MEDLINE database via PubMed search. The trials are annotated with argument components and argumentative relations.
1 PAPER • 2 BENCHMARKS
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
1 PAPER • NO BENCHMARKS YET
MultiTACRED is a multilingual version of the large-scale TAC Relation Extraction Dataset. It covers 12 typologically diverse languages from 9 language families, and was created by the Speech & Language Technology group of DFKI by machine-translating the instances of the original TACRED dataset and automatically projecting their entity annotations. For details of the original TACRED's data collection and annotation process, see the Stanford paper. Translations are syntactically validated by checking the correctness of the XML tag markup. Any translations with an invalid tag structure, e.g. missing or invalid head or tail tag pairs, are discarded (on average, 2.3% of the instances).
The relational pattern similarity dataset is a new dataset upon the work of Zeichner et al. (2012), which consists of relational patterns with semantic inference labels annotated. The dataset includes 5,555 pairs extracted by Reverb (Fader et al., 2011), 2,447 pairs with inference relation and 3,108 pairs (the rest) without one.
Graph Neural Networks (GNNs) have gained traction across different domains such as transportation, bio-informatics, language processing, and computer vision. However, there is a noticeable absence of research on applying GNNs to supply chain networks. Supply chain networks are inherently graphlike in structure, making them prime candidates for applying GNN methodologies. This opens up a world of possibilities for optimizing, predicting, and solving even the most complex supply chain problems. A major setback in this approach lies in the absence of real-world benchmark datasets to facilitate the research and resolution of supply chain problem using GNNs. To address the issue, we present a real-world benchmark dataset for temporal tasks, obtained from one of the leading FMCG companies in Bangladesh, focusing on supply chain planning for production purposes. The dataset includes temporal data as node features to enable sales predictions, production planning, and the identification of fact
Green family of datasets for emergent communications on relations.
533 parallel examples sampled from TACRED, translated into Russian and Korean (and 3 additional examples in Russian), accompanied with tranlsation of a list of trigger words collected for the different relations.
Dataset Introduction TFH_Annotated_Dataset is an annotated patent dataset pertaining to thin film head technology in hard-disk. To the best of our knowledge, this is the second labeled patent dataset public available in technology management domain that annotates both entities and the semantic relations between entities, the first one is [1].
0 PAPER • NO BENCHMARKS YET