The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are included per movie. The dataset contains additional unlabeled data.
1,577 PAPERS • 11 BENCHMARKS
PAWS-X contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki.
160 PAPERS • 2 BENCHMARKS
Paraphrase Adversaries from Word Scrambling (PAWS) is a dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification. The dataset has two subsets, one based on Wikipedia and the other one based on the Quora Question Pairs (QQP) dataset.
144 PAPERS • NO BENCHMARKS YET
The Yelp Dataset is a valuable resource for academic research, teaching, and learning. It provides a rich collection of real-world data related to businesses, reviews, and user interactions. Here are the key details about the Yelp Dataset: Reviews: A whopping 6,990,280 reviews from users. Businesses: Information on 150,346 businesses. Pictures: A collection of 200,100 pictures. Metropolitan Areas: Data from 11 metropolitan areas. Tips: Over 908,915 tips provided by 1,987,897 users. Business Attributes: Details like hours, parking availability, and ambiance for more than 1.2 million businesses. Aggregated Check-ins: Historical check-in data for each of the 131,930 businesses.
68 PAPERS • 21 BENCHMARKS
WikiHop is a multi-hop question-answering dataset. The query of WikiHop is constructed with entities and relations from WikiData, while supporting documents are from WikiReading. A bipartite graph connecting entities and documents is first built and the answer for each query is located by traversal on this graph. Candidates that are type-consistent with the answer and share the same relation in query with the answer are included, resulting in a set of candidates. Thus, WikiHop is a multi-choice style reading comprehension data set. There are totally about 43K samples in training set, 5K samples in development set and 2.5K samples in test set. The test set is not provided. The task is to predict the correct answer given a query and multiple supporting documents.
67 PAPERS • 2 BENCHMARKS
Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary value indicating whether the two questions are paraphrase of each other.
53 PAPERS • 8 BENCHMARKS
Paraphrase and Semantic Similarity in Twitter (PIT) presents a constructed Twitter Paraphrase Corpus that contains 18,762 sentence pairs.
22 PAPERS • 1 BENCHMARK
Paralex learns from a collection of 18 million question-paraphrase pairs scraped from WikiAnswers.
20 PAPERS • 1 BENCHMARK
FLUE is a French Language Understanding Evaluation benchmark. It consists of 5 tasks: Text Classification, Paraphrasing, Natural Language Inference, Constituency Parsing and Part-of-Speech Tagging, and Word Sense Disambiguation.
12 PAPERS • NO BENCHMARKS YET
Twitter News URL Corpus is a human-labeled paraphrase corpus to date of 51,524 sentence pairs and the first cross-domain benchmarking for automatic paraphrase identification.
6 PAPERS • 1 BENCHMARK
Finnish Paraphrase Corpus is a fully manually annotated paraphrase corpus for Finnish containing 53,572 paraphrase pairs harvested from alternative subtitles and news headings. Out of all paraphrase pairs in the corpus 98% are manually classified to be paraphrases at least in their given context, if not in all contexts.
3 PAPERS • NO BENCHMARKS YET
PARADE contains paraphrases that overlap very little at the lexical and syntactic level but are semantically equivalent based on computer science domain knowledge, as well as non-paraphrases that overlap greatly at the lexical and syntactic level but are not semantically equivalent based on this domain knowledge.
SV-Ident comprises 4,248 sentences from social science publications in English and German. The data is the official data for the Shared Task: “Survey Variable Identification in Social Science Publications” (SV-Ident) 2022. Sentences are labeled with variables that are mentioned either explicitly or implicitly.
3 PAPERS • 2 BENCHMARKS
This is a paraphrasing dataset created using the adversarial paradigm. A task was designed called the Adversarial Paraphrasing Task (APT) whose objective was to write sentences that mean the same as a given sentence but have as different syntactical and lexical properties as possible.
1 PAPER • 1 BENCHMARK
This is a benchmark for neural paraphrase detection, to differentiate between original and machine-generated content.
1 PAPER • NO BENCHMARKS YET
For more details see https://huggingface.co/datasets/jpwahle/autoregressive-paraphrase-dataset
This dataset is used to train and evaluate models for the detection of machine-paraphrased text.