Generation, Evaluation, and Metrics (GEM) is a benchmark environment for Natural Language Generation with a focus on its Evaluation, both through human annotations and automated Metrics.
42 PAPERS • 1 BENCHMARK
The Extreme Summarization (XSum) dataset is a dataset for evaluation of abstractive single-document summarization systems. The goal is to create a short, one-sentence new summary answering the question “What is the article about?”. The dataset consists of 226,711 news articles accompanied with a one-sentence summary. The articles are collected from BBC articles (2010 to 2017) and cover a wide variety of domains (e.g., News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts). The official random split contains 204,045 (90%), 11,332 (5%) and 11,334 (5) documents in training, validation and test sets, respectively.
27 PAPERS • 6 BENCHMARKS
A new multi-target dataset of 5.4K TLDRs over 3.2K papers. SciTLDR contains both author-written and expert-derived TLDRs, where the latter are collected using a novel annotation protocol that produces high-quality summaries while minimizing annotation burden.
11 PAPERS • NO BENCHMARKS YET
CiteSum is a large-scale scientific extreme summarization benchmark.
4 PAPERS • 1 BENCHMARK
An open corpus of Scientific Research papers which has a representative sample from across scientific disciplines. This corpus not only includes the full text of the article, but also the metadata of the documents, along with the bibliographic information for each reference.
2 PAPERS • NO BENCHMARKS YET
TLDR9+ is a large-scale summarization dataset containing over 9 million training instances extracted from Reddit discussion forum. This dataset is specifically gathered to perform extreme summarization (i.e., generating one-sentence summary in high compression and abstraction) and is more than twice larger than the previously proposed dataset. With the help of human annotations, a more fine-grained dataset is distilled by sampling High-Quality instances from TLDR9+ and call it TLDRHQ. dataset.
2 PAPERS • 1 BENCHMARK
SummZoo, a benchmark consists of 8 diverse summarization tasks with multiple sets of few-shot samples for each task, covering both monologue and dialogue domains.
1 PAPER • NO BENCHMARKS YET