WikiMatrix is a dataset of parallel sentences in the textual content of Wikipedia for all possible language pairs. The mined data consists of:
84 PAPERS • NO BENCHMARKS YET
Contains 51,583 descriptions of 11,046 objects from 800 ScanNet scenes. ScanRefer is the first large-scale effort to perform object localization via natural language expression directly in 3D.
37 PAPERS • NO BENCHMARKS YET
Paraphrase and Semantic Similarity in Twitter (PIT) presents a constructed Twitter Paraphrase Corpus that contains 18,762 sentence pairs.
22 PAPERS • 1 BENCHMARK
The Discovery datasets consists of adjacent sentence pairs (s1,s2) with a discourse marker (y) that occurred at the beginning of s2. They were extracted from the depcc web corpus.
9 PAPERS • 1 BENCHMARK
SemEval 2014 is a collection of datasets used for the Semantic Evaluation (SemEval) workshop, an annual event that focuses on the evaluation and comparison of systems that can analyze diverse semantic phenomena in text. The datasets from SemEval 2014 are used for various tasks, including but not limited to:
6 PAPERS • NO BENCHMARKS YET
GeoCoV19 is a large-scale Twitter dataset containing more than 524 million multilingual tweets. The dataset contains around 378K geotagged tweets and 5.4 million tweets with Place information. The annotations include toponyms from the user location field and tweet content and resolve them to geolocations such as country, state, or city level. In this case, 297 million tweets are annotated with geolocation using the user location field and 452 million tweets using tweet content.
3 PAPERS • NO BENCHMARKS YET
AuxAD is a a distantly supervised dataset for acronym disambiguation.
1 PAPER • NO BENCHMARKS YET
AuxAI is a distantly supervised dataset for acronym identification.
COSTRA 1.0 is a dataset of complex sentence transformations. The dataset is intended for the study of sentence-level embeddings beyond simple word alternations or standard paraphrasing. The first version of the dataset is limited to sentences in Czech but the construction method is universal and the authors plan to use it also for other languages. The dataset consist of 4,262 unique sentences with average length of 10 words, illustrating 15 types of modifications such as simplification, generalization, or formal and informal language variation.
Composed of judgments from the French Court of cassation and their corresponding summaries.
A distant supervision dataset by linking the entire English ClueWeb09 corpus to Freebase.