8 dataset results for Discourse Parsing

A machine reading comprehension (MRC) dataset with discourse structure built over multiparty dialog. Molweni's source samples from the Ubuntu Chat Corpus, including 10,000 dialogs comprising 88,303 utterances.

25 PAPERS • 2 BENCHMARKS

RST-DT

RST-DT (RST Discourse Treebank)

The Rhetorical Structure Theory (RST) Discourse Treebank consists of 385 Wall Street Journal articles from the Penn Treebank annotated with discourse structure in the RST framework along with human-generated extracts and abstracts associated with the source documents.

20 PAPERS • 1 BENCHMARK

GUM (Georgetown University Multilayer corpus)

GUM is an open source multilayer English corpus of richly annotated texts from twelve text types. Annotations include:

8 PAPERS • 1 BENCHMARK

AMALGUM

AMALGUM (A Machine Annotated Lookalike of GUM)

AMALGUM is a machine annotated multilayer corpus following the same design and annotation layers as GUM, but substantially larger (around 4M tokens). The goal of this corpus is to close the gap between high quality, richly annotated, but small datasets, and the larger but shallowly annotated corpora that are often scraped from the Web.

5 PAPERS • NO BENCHMARKS YET

Instructional-DT (Instr-DT)

Instructional-DT (Instr-DT) (Instructional Discourse Treebank)

This discourse treebank includes annotated instructional texts originally assembled at the Information Technology Research Institute, University of Brighton. This dataset contains 176 documents with an average of 32.6 EDUs for a total of 5744 EDUs and 53,250 words.

4 PAPERS • 1 BENCHMARK

DISRPT2021

DISRPT2021 (DISRPT2021 shared task on Discourse Unit Segmentation, Connective Detection and Discourse Relation Classification)

The DISRPT 2021 shared task, co-located with CODI 2021 at EMNLP, introduces the second iteration of a cross-formalism shared task on discourse unit segmentation and connective detection, as well as the first iteration of a cross-formalism discourse relation classification task.

3 PAPERS • NO BENCHMARKS YET

SPOT (Sentiment Polarity Annotations Dataset)

The SPOT dataset contains 197 reviews originating from the Yelp'13 and IMDB collections (1), annotated with segment-level polarity labels (positive/neutral/negative). Annotations have been gathered on 2 levels of granulatiry:

3 PAPERS • NO BENCHMARKS YET

PCC

PCC (Potsdam Commentary Corpus)

The Potsdam Commentary Corpus (PCC) is a corpus of 220 German newspaper commentaries (2.900 sentences, 44.000 tokens) taken from the online issues of the Märkische Allgemeine Zeitung (MAZ subcorpus) and Tagesspiegel (ProCon subcorpus) and is annotated with a range of different types of linguistic information.

2 PAPERS • NO BENCHMARKS YET

Datasets

8 dataset results for Discourse Parsing