The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities. Big-bench include more than 200 tasks.
222 PAPERS • 134 BENCHMARKS
WiC is a benchmark for the evaluation of context-sensitive word embeddings. WiC is framed as a binary classification task. Each instance in WiC has a target word w, either a verb or a noun, for which two contexts are provided. Each of these contexts triggers a specific meaning of w. The task is to identify if the occurrences of w in the two contexts correspond to the same meaning or not. In fact, the dataset can also be viewed as an application of Word Sense Disambiguation in practise.
168 PAPERS • NO BENCHMARKS YET
The Evaluation framework of Raganato et al. 2017 includes two training sets (SemCor-Miller et al., 1993- and OMSTI-Taghipour and Ng, 2015-) and five test sets from the Senseval/SemEval series (Edmonds and Cotton, 2001; Snyder and Palmer, 2004; Pradhan et al., 2007; Navigli et al., 2013; Moro and Navigli, 2015), standardized to the same format and sense inventory (i.e. WordNet 3.0).
107 PAPERS • 3 BENCHMARKS
There are now many computer programs for automatically determining the sense of a word in context (Word Sense Disambiguation or WSD). The purpose of SENSEVAL is to evaluate the strengths and weaknesses of such programs with respect to different words, different varieties of language, and different languages.
19 PAPERS • NO BENCHMARKS YET
FLUE is a French Language Understanding Evaluation benchmark. It consists of 5 tasks: Text Classification, Paraphrasing, Natural Language Inference, Constituency Parsing and Part-of-Speech Tagging, and Word Sense Disambiguation.
12 PAPERS • NO BENCHMARKS YET
WiC-TSV is a new multi-domain evaluation benchmark for Word Sense Disambiguation. More specifically, it is a framework for Target Sense Verification of Words in Context which grounds its uniqueness in the formulation as a binary classification task thus being independent of external sense inventories, and the coverage of various domains. This makes the dataset highly flexible for the evaluation of a diverse set of models and systems in and across domains. WiC-TSV provides three different evaluation settings, depending on the input signals provided to the model.
9 PAPERS • 2 BENCHMARKS
WiC: The Word-in-Context Dataset A reliable benchmark for the evaluation of context-sensitive word embeddings.
7 PAPERS • 1 BENCHMARK
SubjQA is a question answering dataset that focuses on subjective (as opposed to factual) questions and answers. The dataset consists of roughly 10,000 questions over reviews from 6 different domains: books, movies, grocery, electronics, TripAdvisor (i.e. hotels), and restaurants. Each question is paired with a review and a span is highlighted as the answer to the question (with some questions having no answer). Moreover, both questions and answer spans are assigned a subjectivity label by annotators. Questions such as "How much does this product weigh?" is a factual question (i.e., low subjectivity), while "Is this easy to use?" is a subjective question (i.e., high subjectivity).
7 PAPERS • NO BENCHMARKS YET
The CoarseWSD-20 dataset is a coarse-grained sense disambiguation dataset built from Wikipedia (nouns only) targeting 2 to 5 senses of 20 ambiguous words. It was specifically designed to provide an ideal setting for evaluating Word Sense Disambiguation (WSD) models (e.g. no senses in test sets missing from training), both quantitively and qualitatively.
6 PAPERS • NO BENCHMARKS YET
Verse is a new dataset that augments existing multimodal datasets (COCO and TUHOI) with sense labels.
4 PAPERS • NO BENCHMARKS YET
Intended to provide freely available data sets in various formats together with basic annotation to be useful for applications in computational linguistics, translation studies and cross-linguistic corpus studies.
3 PAPERS • NO BENCHMARKS YET
The Parallel Meaning Bank (PMB), developed at the University of Groningen and building upon the Groningen Meaning Bank, comprises sentences and texts in raw and tokenised format, syntactic analysis, word senses, thematic roles, reference resolution, and formal meaning representations. The main objective of the PMB is to provide fine-grained meaning representations for words, sentences and texts. Sentences are, in isolation, often ambiguous. The aim is to provide the most likely interpretation for a sentence, with a minimal use of underspecification.
2 PAPERS • NO BENCHMARKS YET
ReviewQA is a question-answering dataset based on hotel reviews. The questions of this dataset are linked to a set of relational understanding competencies that a model is expected to master. Indeed, each question comes with an associated type that characterizes the required competency.
The Part-Whole Relations dataset is a dataset of semantic relations between entities. It contains the following subtypes: - Component-Of - Member-Of - Portion-Of - Stuff-Of - Located-In - Contained-In - Phase-Of - Participates-In
1 PAPER • NO BENCHMARKS YET
SBU-WSD-Corpus is a corpus for Persian Word Sense Disambiguation (WSD). It is manually annotated with senses from the Persian WordNet (FarsNet) sense inventory. SBU-WSD-Corpus consists of 19 Persian documents in different domains such as Sports, Science, Arts, etc. It includes 5892 content words of Persian running text and 3371 manually sense annotated words (2073 nouns, 566 verbs, 610 adjectives, and 122 adverbs).