A machine reading comprehension (MRC) dataset with discourse structure built over multiparty dialog. Molweni's source samples from the Ubuntu Chat Corpus, including 10,000 dialogs comprising 88,303 utterances.
25 PAPERS • 2 BENCHMARKS
The Rhetorical Structure Theory (RST) Discourse Treebank consists of 385 Wall Street Journal articles from the Penn Treebank annotated with discourse structure in the RST framework along with human-generated extracts and abstracts associated with the source documents.
20 PAPERS • 1 BENCHMARK
GUM is an open source multilayer English corpus of richly annotated texts from twelve text types. Annotations include:
8 PAPERS • 1 BENCHMARK
AMALGUM is a machine annotated multilayer corpus following the same design and annotation layers as GUM, but substantially larger (around 4M tokens). The goal of this corpus is to close the gap between high quality, richly annotated, but small datasets, and the larger but shallowly annotated corpora that are often scraped from the Web.
5 PAPERS • NO BENCHMARKS YET
This discourse treebank includes annotated instructional texts originally assembled at the Information Technology Research Institute, University of Brighton. This dataset contains 176 documents with an average of 32.6 EDUs for a total of 5744 EDUs and 53,250 words.
4 PAPERS • 1 BENCHMARK
The DISRPT 2021 shared task, co-located with CODI 2021 at EMNLP, introduces the second iteration of a cross-formalism shared task on discourse unit segmentation and connective detection, as well as the first iteration of a cross-formalism discourse relation classification task.
3 PAPERS • NO BENCHMARKS YET
The SPOT dataset contains 197 reviews originating from the Yelp'13 and IMDB collections (1), annotated with segment-level polarity labels (positive/neutral/negative). Annotations have been gathered on 2 levels of granulatiry:
The Potsdam Commentary Corpus (PCC) is a corpus of 220 German newspaper commentaries (2.900 sentences, 44.000 tokens) taken from the online issues of the Märkische Allgemeine Zeitung (MAZ subcorpus) and Tagesspiegel (ProCon subcorpus) and is annotated with a range of different types of linguistic information.
2 PAPERS • NO BENCHMARKS YET