The Newsela dataset was introduced by Xu et al. in their research on text simplification. It is a corpus that includes thousands of news articles professionally leveled to different reading complexities. The dataset is used for academic research in fields such as text difficulty and text simplification. It is made available to academic partners upon request. The dataset is often used as a benchmark in the field of text simplification. Please note that the Newsela dataset is different from the NELA datasets, which are collections of news articles for the study of media bias and other applications.
104 PAPERS • 1 BENCHMARK
WikiLarge comprise 359 test sentences, 2000 development sentences and 300k training sentences. Each source sentences in test set has 8 simplified references
65 PAPERS • NO BENCHMARKS YET
ASSET is a new dataset for assessing sentence simplification in English. ASSET is a crowdsourced multi-reference corpus where each simplification was produced by executing several rewriting transformations.
53 PAPERS • 1 BENCHMARK
TurkCorpus, a dataset with 2,359 original sentences from English Wikipedia, each with 8 manual reference simplifications. The dataset is divided into two subsets: 2,000 sentences for validation and 359 for testing of sentence simplification models.
43 PAPERS • 1 BENCHMARK
Useful for through two applications - automatic readability assessment and automatic text simplification. The corpus consists of 189 texts, each in three versions (567 in total).
27 PAPERS • NO BENCHMARKS YET
Contains one million naturally occurring sentence rewrites, providing sixty times more distinct split examples and a ninety times larger vocabulary than the WebSplit corpus introduced by Narayan et al. (2017) as a benchmark for this task.
21 PAPERS • 1 BENCHMARK
TextComplexityDE is a dataset consisting of 1000 sentences in German language taken from 23 Wikipedia articles in 3 different article-genres to be used for developing text-complexity predictor models and automatic text simplification in German language. The dataset includes subjective assessment of different text-complexity aspects provided by German learners in level A and B. In addition, it contains manual simplification of 250 of those sentences provided by native speakers and subjective assessment of the simplified sentences by participants from the target group. The subjective ratings were collected using both laboratory studies and crowdsourcing approach.
16 PAPERS • 1 BENCHMARK
The dataset introduces document alignments between German Wikipedia and the children's lexicon Klexikon. The source texts in Wikipedia are both written in a more complex language than Klexikon, and also significantly longer, which makes this a suitable application for both summarization and simplification. In fact, previous research has so far only focused on either of the two, but not comprehensively been studied as a joint task.
4 PAPERS • 1 BENCHMARK
CEFR-SP contains 17k English sentences annotated with the levels based on the Common European Framework of Reference for Languages assigned by English-education professionals.
3 PAPERS • NO BENCHMARKS YET
This dataset contains around 5000 scholarly articles and their corresponding easy summary from eureka alert blog, the dataset can be used for the combined task of summarization and simplification.
3 PAPERS • 1 BENCHMARK
Med-EASi (Medical dataset for Elaborative and Abstractive Simplification), a uniquely crowdsourced and finely annotated dataset for supervised simplification of short medical texts. It contains 1979 expert-simple text pairs in medical domain, spanning a total of 4478 UMLS concepts across all text pairs. The dataset is annotated with four textual transformations: replacement, elaboration, insertion and deletion.
DEplain-APA-sent: A German Parallel Corpus for Sentence Simplification on News Texts DEplain is a new dataset of parallel, professionally written and manually aligned simplifications in plain German “plain DE” (or in German: “Einfache Sprache”). DEplain consists of four main subcorpora: DEplain-APA-doc, DEplain-APA-sent, DEplain-web-doc, and DEplain-web-sent.
2 PAPERS • 1 BENCHMARK
DEplain-web-sent: A German Parallel Corpus for Sentence Simplification on Web Texts DEplain is a new dataset of parallel, professionally written and manually aligned simplifications in plain German “plain DE” (or in German: “Einfache Sprache”). DEplain consists of four main subcorpora: DEplain-APA-doc, DEplain-APA-sent, DEplain-web-doc, and DEplain-web-sent.
TextBox 2.0 is a comprehensive and unified library for text generation, focusing on the use of pre-trained language models (PLMs). The library covers 13 common text generation tasks and their corresponding 83 datasets and further incorporates 45 PLMs covering general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight PLMs.
2 PAPERS • NO BENCHMARKS YET
DEplain-APA-doc: A German Parallel Corpus for Document Simplification on News Texts DEplain is a new dataset of parallel, professionally written and manually aligned simplifications in plain German “plain DE” (or in German: “Einfache Sprache”). DEplain consists of four main subcorpora: DEplain-APA-doc, DEplain-APA-sent, DEplain-web-doc, and DEplain-web-sent.
1 PAPER • 1 BENCHMARK
DEplain-web-doc: A German Parallel Corpus for Document Simplification on Web Texts DEplain is a new dataset of parallel, professionally written and manually aligned simplifications in plain German “plain DE” (or in German: “Einfache Sprache”). DEplain consists of four main subcorpora: DEplain-APA-doc, DEplain-APA-sent, DEplain-web-doc, and DEplain-web-sent.
The goal of InfoLossQA is to generate a series of QA pairs that reveal to lay readers what information a simplified text lacks compared to its original.
1 PAPER • NO BENCHMARKS YET
Multiword expressions (MWEs) represent lexemes that should be treated as single lexical units due to their idiosyncratic nature. MWE-CWI is a dataset for MWE detection based on the Complex Word Identification Shared Task 2018 dataset.
A medical Wiki paralell corpus for medical text simplification.
SimpEvalASSET is a dataset for learning learnable metrics using modern language models. It comprises of 12K human ratings on 2.4K simplifications of 24 systems, and SIMPEVAL_2022, a challenging simplification benchmark consisting of over 1K human ratings of 360 simplifications including generations from GPT-3.5.