The Reddit dataset is a graph dataset from Reddit posts made in the month of September, 2014. The node label in this case is the community, or “subreddit”, that a post belongs to. 50 large communities have been sampled to build a post-to-post graph, connecting posts if the same user comments on both. In total this dataset contains 232,965 posts with an average degree of 492. The first 20 days are used for training and the remaining days for testing (with 30% used for validation). For features, off-the-shelf 300-dimensional GloVe CommonCrawl word vectors are used.
587 PAPERS • 13 BENCHMARKS
CNN/Daily Mail is a dataset for text summarization. Human generated abstractive summary bullets were generated from news stories in CNN and Daily Mail websites as questions (with one of the entities hidden), and stories as the corresponding passages from which the system is expected to answer the fill-in the-blank question. The authors released the scripts that crawl, extract and generate pairs of passages and questions from these websites.
463 PAPERS • 10 BENCHMARKS
The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com. The corpus includes:
264 PAPERS • 7 BENCHMARKS
A new dataset with abstractive dialogue summaries.
106 PAPERS • 6 BENCHMARKS
WikiHow is a dataset of more than 230,000 article and summary pairs extracted and constructed from an online knowledge base written by different human authors. The articles span a wide range of topics and represent high diversity styles.
106 PAPERS • 2 BENCHMARKS
CORNELL NEWSROOM is a large dataset for training and evaluating summarization systems. It contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications. The summaries are obtained from search and social metadata between 1998 and 2017 and use a variety of summarization strategies combining extraction and abstraction.
101 PAPERS • NO BENCHMARKS YET
LCSTS is a large corpus of Chinese short text summarization dataset constructed from the Chinese microblogging website Sina Weibo, which is released to the public. This corpus consists of over 2 million real Chinese short texts with short summaries given by the author of each text. The authors also manually tagged the relevance of 10,666 short summaries with their corresponding short texts 10,666 short summaries with their corresponding short texts.
57 PAPERS • 2 BENCHMARKS
WikiSum is a dataset based on English Wikipedia and suitable for a task of multi-document abstractive summarization. In each instance, the input is comprised of a Wikipedia topic (title of article) and a collection of non-Wikipedia reference documents, and the target is the Wikipedia article text. The dataset is restricted to the articles with at least one crawlable citation. The official split divides the articles roughly into 80/10/10 for train/development/test subsets, resulting in 1865750, 233252, and 232998 examples respectively.
51 PAPERS • NO BENCHMARKS YET
WikiLingua includes ~770k article and summary pairs in 18 languages from WikiHow. Gold-standard article-summary alignments across languages are extracted by aligning the images that are used to describe each how-to step in an article.
50 PAPERS • 5 BENCHMARKS
Reddit TIFU dataset is a newly collected Reddit dataset, where TIFU denotes the name of /r/tifu subbreddit. There are 122,933 text-summary pairs in total.
44 PAPERS • 1 BENCHMARK
XL-Sum is a comprehensive and diverse dataset for abstractive summarization comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.
41 PAPERS • NO BENCHMARKS YET
A large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community.
40 PAPERS • 5 BENCHMARKS
DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 dialogues with corresponding manually labeled summaries and topics.
38 PAPERS • 2 BENCHMARKS
BillSum is the first dataset for summarization of US Congressional and California state bills.
36 PAPERS • 2 BENCHMARKS
BookSum is a collection of datasets for long-form narrative summarization. This dataset covers source documents from the literature domain, such as novels, plays and stories, and includes highly abstractive, human written summaries on three levels of granularity of increasing difficulty: paragraph-, chapter-, and book-level. The domain and structure of this dataset poses a unique set of challenges for summarization systems, which include: processing very long documents, non-trivial causal and temporal dependencies, and rich discourse structures.
29 PAPERS • 1 BENCHMARK
The Extreme Summarization (XSum) dataset is a dataset for evaluation of abstractive single-document summarization systems. The goal is to create a short, one-sentence new summary answering the question “What is the article about?”. The dataset consists of 226,711 news articles accompanied with a one-sentence summary. The articles are collected from BBC articles (2010 to 2017) and cover a wide variety of domains (e.g., News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts). The official random split contains 204,045 (90%), 11,332 (5%) and 11,334 (5) documents in training, validation and test sets, respectively.
27 PAPERS • 6 BENCHMARKS
The AMR Bank is a set of English sentences paired with simple, readable semantic representations. Version 3.0 released in 2020 consists of 59,255 sentences.
22 PAPERS • 1 BENCHMARK
To study the task of email subject line generation: automatically generating an email subject line from the email body.
18 PAPERS • 1 BENCHMARK
GLGE is a general language generation evaluation benchmark which is composed of 8 language generation tasks, including Abstractive Text Summarization (CNN/DailyMail, Gigaword, XSUM, MSNews), Answer-aware Question Generation (SQuAD 1.1, MSQG), Conversational Question Answering (CoQA), and Personalizing Dialogue (Personachat).
12 PAPERS • NO BENCHMARKS YET
A new multi-target dataset of 5.4K TLDRs over 3.2K papers. SciTLDR contains both author-written and expert-derived TLDRs, where the latter are collected using a novel annotation protocol that produces high-quality summaries while minimizing annotation burden.
11 PAPERS • NO BENCHMARKS YET
5,519 query-based summaries, each associated with an average of 6 input documents selected from an index of 355M documents from Common Crawl.
10 PAPERS • NO BENCHMARKS YET
This is a dataset for evaluating summarisation methods for research papers.
10 PAPERS • 3 BENCHMARKS
The dataset is collected from 159 Critical Role episodes transcribed to text dialogues, consisting of 398,682 turns. It also includes corresponding abstractive summaries collected from the Fandom wiki. The dataset is linguistically unique in that the narratives are generated entirely through player collaboration and spoken interaction.
9 PAPERS • NO BENCHMARKS YET
Presents two high-quality large-scale CLS datasets based on existing monolingual summarization datasets.
Global Voices is a multilingual dataset for evaluating cross-lingual summarization methods. It is extracted from social-network descriptions of Global Voices news articles to cheaply collect evaluation data for into-English and from-English summarization in 15 languages.
8 PAPERS • NO BENCHMARKS YET
A large-scale Indonesian summarization dataset consisting of harvested articles from Liputan6.com, an online news portal, resulting in 215,827 document-summary pairs.
5 PAPERS • NO BENCHMARKS YET
ConvoSumm is a suite of four datasets to evaluate a model’s performance on a broad spectrum of conversation data.
4 PAPERS • NO BENCHMARKS YET
This dataset is an extension of MASAC, a multimodal, multi-party, Hindi-English code-mixed dialogue dataset compiled from the popular Indian TV show, ‘Sarabhai v/s Sarabhai’. WITS was created by augmenting MASAC with natural language explanations for each sarcastic dialogue. The dataset consists of the transcribed sarcastic dialogues from 55 episodes of the TV show, along with audio and video multimodal signals. It was designed to facilitate Sarcasm Explanation in Dialogue (SED), a novel task aimed at generating a natural language explanation for a given sarcastic dialogue, that spells out the intended irony. Each data instance in WITS is associated with a corresponding video, audio, and textual transcript where the last utterance is sarcastic in nature. All the final selected explanations contain the following attributes:
4 PAPERS • 2 BENCHMARKS
The DeepMind Q&A Dataset consists of two datasets for Question Answering, CNN and DailyMail. Each dataset contains many documents (90k and 197k each), and each document companies on average 4 questions approximately. Each question is a sentence with one missing word/phrase which can be found from the accompanying document/context.
3 PAPERS • NO BENCHMARKS YET
Shmoop Corpus is a dataset of 231 stories that are paired with detailed multi-paragraph summaries for each individual chapter (7,234 chapters), where the summary is chronologically aligned with respect to the story chapter. From the corpus, a set of common NLP tasks are constructed, including Cloze-form question answering and a simplified form of abstractive summarization, as benchmarks for reading comprehension on stories.
A set of approximately 100K podcast episodes comprised of raw audio files along with accompanying ASR transcripts. This represents over 47,000 hours of transcribed audio, and is an order of magnitude larger than previous speech-to-text corpora.
Fanpage dataset, containing news articles taken from Fanpage.
2 PAPERS • 1 BENCHMARK
IlPost dataset, containing news articles taken from IlPost.
FINDSum is a large-scale dataset for long text and multi-table summarization. It is built on 21,125 annual reports from 3,794 companies and has two subsets for summarizing each company’s results of operations and liquidity.
2 PAPERS • NO BENCHMARKS YET
PeerSum is a new MDS dataset using peer reviews of scientific publications. The dataset differs from the existing MDS datasets in that summaries (i.e., the meta-reviews) are highly abstractive and they are real summaries of the source documents.
A dataset containing the documents, source and fusion sentences, and human annotations of points of correspondence between sentences. The dataset bridges the gap between coreference resolution and summarization.
A single-document Vietnamese summarization dataset
Pn-summary is a dataset for Persian abstractive text summarization.
This is a dataset for multi-document summarization in Portuguese, what means that it has examples of multiple documents (input) related to human-written summaries (output). In particular, it has entries of multiple related texts from Brazilian websites about a subject, and the summary is the Portuguese Wikipedia lead section on the same subject (lead: the first section, i.e., summary, of any Wipedia article). Input texts were extracted from BrWac corpus, and the output from Brazilian Wikipedia dumps page.
1 PAPER • NO BENCHMARKS YET
The Gigaword Entailment dataset is a dataset for entailment prediction between an article and its headline. It is built from the Gigaword dataset.
A maintained database tracks ICLR submissions and reviews, augmented with author profiles and higher-level textual features.
Inshorts News dataset Inshorts provides a news summary in 60 words or less. Inshorts is a news service that offers short summaries of news from around the web. This dataset contains headlines and a summary of news items and their source.
1 PAPER • 1 BENCHMARK
The MLSum-it dataset is the translated version (Helsinki-NLP/opus-mt-es-it) of the spanish portion of MLSum, containing news articles taken from BBC/mundo.
This dataset was used in the paper 'Template-based Abstractive Microblog Opinion Summarisation' (to be published at TACL, 2022). The data is structured as follows: each file represents a cluster of tweets which contains the tweet IDs and a summary of the tweets written by journalists. The gold standard summary follows a template structure and depending on its opinion content, it contains a main story, majority opinion (if any) and/or minority opinions (if any).
NarraSum is a large-scale narrative summarization dataset. It contains 122K narrative documents, which are collected from plot descriptions of movies and TV episodes with diverse genres, and their corresponding abstractive summaries.
This is a large-scale court judgment dataset, where each judgment is a summary of the case description with a patternized style. It contains 2,003,390 court judgment documents. The case description is used as the input, and the court judgment as the summary. The average lengths of the input documents and summaries are 595.15 words and 273.57 words respectively.
This corpus contains preprocessed posts from the Reddit dataset, suitable for abstractive summarization using deep learning. The format is a json file where each line is a JSON object representing a post. The schema of each post is shown below: - author: string (nullable = true) - body: string (nullable = true) - normalizedBody: string (nullable = true) - content: string (nullable = true) - content_len: long (nullable = true) - summary: string (nullable = true) - summary_len: long (nullable = true) - id: string (nullable = true) - subreddit: string (nullable = true) - subreddit_id: string (nullable = true) - title: string (nullable = true)
wikimulti is a dataset for cross-lingual summarization based on Wikipedia articles in 15 languages.