The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.
10,147 PAPERS • 92 BENCHMARKS
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
55 PAPERS • NO BENCHMARKS YET
Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary value indicating whether the two questions are paraphrase of each other.
53 PAPERS • 8 BENCHMARKS
Paralex learns from a collection of 18 million question-paraphrase pairs scraped from WikiAnswers.
20 PAPERS • 1 BENCHMARK
A large-scale English paraphrase dataset that surpasses prior work in both quantity and quality.
17 PAPERS • NO BENCHMARKS YET
Opusparcus is a paraphrase corpus for six European languages: German, English, Finnish, French, Russian, and Swedish. The paraphrases are extracted from the OpenSubtitles2016 corpus, which contains subtitles from movies and TV shows.
15 PAPERS • NO BENCHMARKS YET
PARANMT-50M is a dataset for training paraphrastic sentence embeddings. It consists of more than 50 million English-English sentential paraphrase pairs.
12 PAPERS • NO BENCHMARKS YET
This is the dataset for the 2020 Duolingo shared task on Simultaneous Translation And Paraphrase for Language Education (STAPLE). Sentence prompts, along with automatic translations, and high-coverage sets of translation paraphrases weighted by user response are provided in 5 language pairs. Starter code for this task can be found here: github.com/duolingo/duolingo-sharedtask-2020/. More details on the data set and task are available at: sharedtask.duolingo.com
10 PAPERS • NO BENCHMARKS YET
OPUS is a growing collection of translated texts from the web. In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus. OPUS is based on open source products and the corpus is also delivered as an open content package. We used several tools to compile the current collection. All pre-processing is done automatically. No manual corrections have been carried out.
9 PAPERS • NO BENCHMARKS YET
Aims to help V-NLIs recognize analytic tasks from free-form natural language by training and evaluating cutting-edge multi-label classification models. The dataset contains diverse user queries, and each is annotated with one or multiple analytic tasks.
2 PAPERS • NO BENCHMARKS YET
StyleKQC is a style-variant paraphrase corpus for korean questions and commands. It was built with a corpus construction scheme that simultaneously considers the core content and style of directives, namely intent and formality, for the Korean language. Utilizing manually generated natural language queries on six daily topics, the corpus was expanded to formal and informal sentences by human rewriting and transferring.
TextBox 2.0 is a comprehensive and unified library for text generation, focusing on the use of pre-trained language models (PLMs). The library covers 13 common text generation tasks and their corresponding 83 datasets and further incorporates 45 PLMs covering general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight PLMs.
This is a paraphrasing dataset created using the adversarial paradigm. A task was designed called the Adversarial Paraphrasing Task (APT) whose objective was to write sentences that mean the same as a given sentence but have as different syntactical and lexical properties as possible.
1 PAPER • 1 BENCHMARK
For more details see https://huggingface.co/datasets/jpwahle/autoregressive-paraphrase-dataset
1 PAPER • NO BENCHMARKS YET
We introduce a new task of rephrasing for amore natural virtual assistant. Currently, vir-tual assistants work in the paradigm of intent-slot tagging and the slot values are directlypassed as-is to the execution engine. However,this setup fails in some scenarios such as mes-saging when the query given by the user needsto be changed before repeating it or sending itto another user. For example, for queries like‘ask my wife if she can pick up the kids’ or ‘re-mind me to take my pills’, we need to rephrasethe content to ‘can you pick up the kids’ and‘take your pills’. In this paper, we study theproblem of rephrasing with messaging as ause case and release a dataset of 3000 pairs oforiginal query and rephrased query. We showthat BART, a pre-trained transformers-basedmasked language model with auto-regressivedecoding, is a strong baseline for the task, andshow improvements by adding a copy-pointerand copy loss to it. We analyze different trade-offs of BART-based and LSTM-based seq2seqmodels
This is a dataset of paraphrases created by ChatGPT.
0 PAPER • NO BENCHMARKS YET