ConceptNet is a knowledge graph that connects words and phrases of natural language with labeled edges. Its knowledge is collected from many sources that include expert-created resources, crowd-sourcing, and games with a purpose. It is designed to represent the general knowledge involved in understanding language, improving natural language applications by allowing the application to better understand the meanings behind the words people use.
799 PAPERS • NO BENCHMARKS YET
FrameNet is a linguistic knowledge graph containing information about lexical and predicate argument semantics of the English language. FrameNet contains two distinct entity classes: frames and lexical units, where a frame is a meaning and a lexical unit is a single meaning for a word.
434 PAPERS • NO BENCHMARKS YET
BookCorpus is a large collection of free novel books written by unpublished authors, which contains 11,038 books (around 74M sentences and 1G words) of 16 different sub-genres (e.g., Romance, Historical, Adventure, etc.).
322 PAPERS • 1 BENCHMARK
WiC is a benchmark for the evaluation of context-sensitive word embeddings. WiC is framed as a binary classification task. Each instance in WiC has a target word w, either a verb or a noun, for which two contexts are provided. Each of these contexts triggers a specific meaning of w. The task is to identify if the occurrences of w in the two contexts correspond to the same meaning or not. In fact, the dataset can also be viewed as an application of Word Sense Disambiguation in practise.
168 PAPERS • NO BENCHMARKS YET
BioASQ is a question answering dataset. Instances in the BioASQ dataset are composed of a question (Q), human-annotated answers (A), and the relevant contexts (C) (also called snippets).
163 PAPERS • 2 BENCHMARKS
The One Billion Word dataset is a dataset for language modeling. The training/held-out data was produced from the WMT 2011 News Crawl data using a combination of Bash shell and Perl scripts.
134 PAPERS • 2 BENCHMARKS
WinoBias contains 3,160 sentences, split equally for development and test, created by researchers familiar with the project. Sentences were created to follow two prototypical templates but annotators were encouraged to come up with scenarios where entities could be interacting in plausible ways. Templates were selected to be challenging and designed to cover cases requiring semantics and syntax separately.
108 PAPERS • NO BENCHMARKS YET
PadChest is a labeled large-scale, high resolution chest x-ray dataset for the automated exploration of medical images along with their associated reports. This dataset includes more than 160,000 images obtained from 67,000 patients that were interpreted and reported by radiologists at Hospital San Juan Hospital (Spain) from 2009 to 2017, covering six different position views and additional information on image acquisition and patient demography. The reports were labeled with 174 different radiographic findings, 19 differential diagnoses and 104 anatomic locations organized as a hierarchical taxonomy and mapped onto standard Unified Medical Language System (UMLS) terminology. Of these reports, 27% were manually annotated by trained physicians and the remaining set was labeled using a supervised method based on a recurrent neural network with attention mechanisms. The labels generated were then validated in an independent test set achieving a 0.93 Micro-F1 score.
84 PAPERS • NO BENCHMARKS YET
WikiMatrix is a dataset of parallel sentences in the textual content of Wikipedia for all possible language pairs. The mined data consists of:
CELEX database comprises three different searchable lexical databases, Dutch, English and German. The lexical data contained in each database is divided into five categories: orthography, phonology, morphology, syntax (word class) and word frequency.
57 PAPERS • NO BENCHMARKS YET
WikiANN, also known as PAN-X, is a multilingual named entity recognition dataset. It consists of Wikipedia articles that have been annotated with LOC (location), PER (person), and ORG (organization) tags in the IOB2 format¹². This dataset serves as a valuable resource for training and evaluating named entity recognition models across various languages.
57 PAPERS • 3 BENCHMARKS
Over a period of many years during the 1990s, a large group of psychologists all over the world collected data in the ISEAR project, directed by Klaus R. Scherer and Harald Wallbott. Student respondents, both psychologists and non-psychologists, were asked to report situations in which they had experienced all of 7 major emotions (joy, fear, anger, sadness, disgust, shame, and guilt). In each case, the questions covered the way they had appraised the situation and how they reacted. The final data set thus contained reports on seven emotions each by close to 3000 respondents in 37 countries on all 5 continents.
49 PAPERS • NO BENCHMARKS YET
PanLex translates words in thousands of languages. Its database is panlingual (emphasizes coverage of every language) and lexical (focuses on words, not sentences).
34 PAPERS • NO BENCHMARKS YET
EVALution dataset is evenly distributed among the three classes (hypernyms, co-hyponyms and random) and involves three types of parts of speech (noun, verb, adjective). The full dataset contains a total of 4,263 distinct terms consisting of 2,380 nouns, 958 verbs and 972 adjectives.
27 PAPERS • NO BENCHMARKS YET
There are now many computer programs for automatically determining the sense of a word in context (Word Sense Disambiguation or WSD). The purpose of SENSEVAL is to evaluate the strengths and weaknesses of such programs with respect to different words, different varieties of language, and different languages.
19 PAPERS • NO BENCHMARKS YET
The first parallel corpus composed from United Nations documents published by the original data creator. The parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish.
17 PAPERS • NO BENCHMARKS YET
An expert-annotated word similarity dataset which provides a highly reliable, yet challenging, benchmark for rare word representation techniques.
10 PAPERS • NO BENCHMARKS YET
The IndoSum dataset is a benchmark dataset for Indonesian text summarization. The dataset consists of news articles and manually constructed summaries.
9 PAPERS • NO BENCHMARKS YET
Polyglot-NER builds massive multilingual annotators with minimal human expertise and intervention.
SEND (Stanford Emotional Narratives Dataset) is a set of rich, multimodal videos of self-paced, unscripted emotional narratives, annotated for emotional valence over time. The complex narratives and naturalistic expressions in this dataset provide a challenging test for contemporary time-series emotion recognition models.
8 PAPERS • NO BENCHMARKS YET
A new challenging dataset that can be used for many pattern recognition tasks.
7 PAPERS • 2 BENCHMARKS
Introduces three datasets of expressing hate, commonly used topics, and opinions for hate speech detection, document classification, and sentiment analysis, respectively.
6 PAPERS • NO BENCHMARKS YET
SemEval 2014 is a collection of datasets used for the Semantic Evaluation (SemEval) workshop, an annual event that focuses on the evaluation and comparison of systems that can analyze diverse semantic phenomena in text. The datasets from SemEval 2014 are used for various tasks, including but not limited to:
The WordNet Language Model Probing (WNLaMPro) dataset consists of relations between keywords and words. It contains 4 different kinds of relations: Antonym, Hypernym, Cohyponym and Corruption.
WikiNEuRal is a high-quality automatically-generated dataset for Multilingual Named Entity Recognition.
5 PAPERS • NO BENCHMARKS YET
Consists of 2,864 videos each with a label from 25 different classes corresponding to an event unfolding 5 seconds. The ERA dataset is designed to have a significant intra-class variation and inter-class similarity and captures dynamic events in various circumstances and at dramatically various scales.
4 PAPERS • NO BENCHMARKS YET
A large-scale evaluation set that provides human ratings for the plausibility of 10,000 SP pairs over five SP relations, covering 2,500 most frequent verbs, nouns, and adjectives in American English.
Some Like it Hoax is a fake news detection dataset consisting of 15,500 Facebook posts and 909,236 users.
The WikiSem500 dataset contains around 500 per-language cluster groups for English, Spanish, German, Chinese, and Japanese (a total of 13,314 test cases).
In this paper, we present AnlamVer, which is a semantic model evaluation dataset for Turkish designed to evaluate word similarity and word relatedness tasks while discriminating those two relations from each other. Our dataset consists of 500 word-pairs annotated by 12 human subjects, and each pair has two distinct scores for similarity and relatedness. Word-pairs are selected to enable the evaluation of distributional semantic models by multiple attributes of words and word-pair relations such as frequency, morphology, concreteness and relation types (e.g., synonymy, antonymy). Our aim is to provide insights to semantic model researchers by evaluating models in multiple attributes. We balance dataset word-pairs by their frequencies to evaluate the robustness of semantic models concerning out-of-vocabulary and rare words problems, which are caused by the rich derivational and inflectional morphology of the Turkish language. (from the original abstract of the dataset paper)
3 PAPERS • NO BENCHMARKS YET
The dataset contains the main components of the news articles published online by the newspaper named <a href="https://gazzettadimodena.gelocal.it/modena">Gazzetta di Modena</a>: url of the web page, title, sub-title, text, date of publication, crime category assigned to each news article by the author.
The dataset consists of the features associated with 402 5-second sound samples. The 402 sounds range from easily identifiable everyday sounds to intentionally obscured artificial ones. The dataset aims to lower the barrier for the study of aural phenomenology as the largest available audio dataset to include an analysis of causal attribution. Each sample has been annotated with crowd-sourced descriptions, as well as familiarity, imageability, arousal, and valence ratings.
The IndicNLP corpus is a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families.
Contains more than 6500 words semantically grouped under 110 categories.
Chinese Gigaword corpus consists of 2.2M of headline-document pairs of news stories covering over 284 months from two Chinese newspapers, namely the Xinhua News Agency of China (XIN) and the Central News Agency of Taiwan (CNA).
2 PAPERS • NO BENCHMARKS YET
The Climate Change Claims dataset for generating fact checking summaries contains claims broadly related to climate change and global warming from climatefeedback.org. It contains 1k documents from 104 different claims from 97 different domains.
This dataset contains information about Japanese word similarity including rare words. The dataset is constructed following the Stanford Rare Word Similarity Dataset. 10 annotators annotated word pairs with 11 levels of similarity.
The MUSE dataset contains bilingual dictionaries for 110 pairs of languages. For each language pair, the training seed dictionaries contain approximately 5000 word pairs while the evaluation sets contain 1500 word pairs.
2 PAPERS • 2 BENCHMARKS
Collects a huge number of job descriptions from Dice.com - one of the most popular career website about Tech jobs in USA. From these job descriptions, skills are extracted for each one by using skills dictionary. Now, the dataset is presented by a list of collections of skills based on job descriptions. After crawling, there are a total of 5GB with more than 1,400,000 job descriptions. From these data, skills are extracted and performed as a list of skills in the same context, the context here includes skills in the same job description.
The pioNER corpus provides gold-standard and automatically generated named-entity datasets for the Armenian language. The automatically generated corpus is generated from Wikipedia. The gold-standard set is a collection of over 250 news articles from iLur.am with manual named-entity annotation. It includes sentences from political, sports, local and world news, and is comparable in size with the test sets of other languages.
We provide a Mikolov-style word-analogy evaluation set specifically for Bangla, with a sample size of 16678, as well as a translated and curated version of the Mikolov dataset, which contains 10594 samples for cross-lingual research.
1 PAPER • NO BENCHMARKS YET
Collects posts (and their reactions) from Facebook pages of large supermarket chains.
A distant supervision dataset by linking the entire English ClueWeb09 corpus to Freebase.
A dataset of sentence pairs annotated following the formalization.
A new word analogy task dataset for Indonesian.
Word embedding is a modern distributed word representations approach widely used in many natural language processing tasks. Converting the vocabulary in a legal document into a word embedding model facilitates subjecting legal documents to machine learning, deep learning, and other algorithms and subsequently performing the downstream tasks of natural language processing vis-à-vis, for instance, document classification, contract review, and machine translation. The most common and practical approach of accuracy evaluation with the word embedding model uses a benchmark set with linguistic rules or the relationship between words to perform analogy reasoning via algebraic calculation. This paper proposes establishing a 1,134 Legal Analogical Reasoning Questions Set (LARQS) from the 2,388 Chinese Codex corpus using five kinds of legal relations, which are then used to evaluate the accuracy of the Chinese word embedding model. Moreover, we discovered that legal relations might be ubiquitous
McQueen dataset contains 15k visual conversations and over 80k queries where each one is associated with a fully-specified rewrite version. In addition, for entities appearing in the rewrite, the corresponding image box annotation is provided.
SART is a collection of three datasets for Similarity, Analogies and Relatedness for the Tatar language. The three subsets are: * Similarity dataset - 202 pairs of words along with averaged human scores of similarity degree between the words (in 0-to-10 scale). For example, "йорт, бина, 7.69". * Relatedness dataset - 252 pairs of words along with averaged human scores of relatedness degree between the words. For example, "урам, балалар, 5.38". * Analogies dataset - set of analytical questions of the form A:B::C:D, meaning A to B as C to D, and D is to be predicted. For example, "Әнкара Төркия Париж Франция". Contains 34 categories, and in total 30 144 questions.