The CommonsenseQA is a dataset for commonsense question answering task. The dataset consists of 12,247 questions with 5 choices each. The dataset was generated by Amazon Mechanical Turk workers in the following process (an example is provided in parentheses):
334 PAPERS • 1 BENCHMARK
WinoGrande is a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations.
331 PAPERS • 1 BENCHMARK
The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities. Big-bench include more than 200 tasks.
222 PAPERS • 134 BENCHMARKS
Visual Dialog (VisDial) dataset contains human annotated questions based on images of MS COCO dataset. This dataset was developed by pairing two subjects on Amazon Mechanical Turk to chat about an image. One person was assigned the job of a ‘questioner’ and the other person acted as an ‘answerer’. The questioner sees only the text description of an image (i.e., an image caption from MS COCO dataset) and the original image remains hidden to the questioner. Their task is to ask questions about this hidden image to “imagine the scene better”. The answerer sees the image, caption and answers the questions asked by the questioner. The two of them can continue the conversation by asking and answering questions for 10 rounds at max.
144 PAPERS • 6 BENCHMARKS
Given a partial description like "she opened the hood of the car," humans can reason about the situation and anticipate what might come next ("then, she examined the engine"). SWAG (Situations With Adversarial Generations) is a large-scale dataset for this task of grounded commonsense inference, unifying natural language inference and physically grounded reasoning.
141 PAPERS • 2 BENCHMARKS
Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) is a large-scale reading comprehension dataset which requires commonsense reasoning. ReCoRD consists of queries automatically generated from CNN/Daily Mail news articles; the answer to each query is a text span from a summarizing passage of the corresponding news. The goal of ReCoRD is to evaluate a machine's ability of commonsense reasoning in reading comprehension. ReCoRD is pronounced as [ˈrɛkərd].
99 PAPERS • 1 BENCHMARK
The AI2’s Reasoning Challenge (ARC) dataset is a multiple-choice question-answering dataset, containing questions from science exams from grade 3 to grade 9. The dataset is split in two partitions: Easy and Challenge, where the latter partition contains the more difficult questions that require reasoning. Most of the questions have 4 answer choices, with <1% of all the questions having either 3 or 5 answer choices. ARC includes a supporting KB of 14.3M unstructured text passages.
86 PAPERS • 3 BENCHMARKS
CoS-E consists of human explanations for commonsense reasoning in the form of natural language sequences and highlighted annotations
43 PAPERS • NO BENCHMARKS YET
Benchmark for physical reasoning that contains a set of simple classical mechanics puzzles in a 2D physical environment. The benchmark is designed to encourage the development of learning algorithms that are sample-efficient and generalize well across puzzles.
31 PAPERS • 2 BENCHMARKS
Useful for through two applications - automatic readability assessment and automatic text simplification. The corpus consists of 189 texts, each in three versions (567 in total).
27 PAPERS • NO BENCHMARKS YET
The COmmonsense Dataset Adversarially-authored by Humans (CODAH) is an evaluation set for commonsense question-answering in the sentence completion style of SWAG. As opposed to other automatically generated NLI datasets, CODAH is adversarially constructed by humans who can view feedback from a pre-trained model and use this information to design challenging commonsense questions. It contains 2801 questions in total, and uses 5-fold cross validation for evaluation.
26 PAPERS • 2 BENCHMARKS
MCScript is used as the official dataset of SemEval2018 Task11. This dataset constructs a collection of text passages about daily life activities and a series of questions referring to each passage, and each question is equipped with two answer choices. The MCScript comprises 9731, 1411, and 2797 questions in training, development, and test set respectively.
24 PAPERS • NO BENCHMARKS YET
A testbed for commonsense reasoning about entity knowledge, bridging fact-checking about entities with commonsense inferences.
23 PAPERS • NO BENCHMARKS YET
Current visual question answering (VQA) tasks mainly consider answering human-annotated questions for natural images in the daily-life context. Icon question answering (IconQA) is a benchmark which aims to highlight the importance of abstract diagram understanding and comprehensive cognitive reasoning in real-world diagram word problems. For this benchmark, a large-scale IconQA dataset is built that consists of three sub-tasks: multi-image-choice, multi-text-choice, and filling-in-the-blank. Compared to existing VQA benchmarks, IconQA requires not only perception skills like object recognition and text understanding, but also diverse cognitive reasoning skills, such as geometric reasoning, commonsense reasoning, and arithmetic reasoning.
23 PAPERS • 1 BENCHMARK
RecipeQA is a dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from approximately 20K unique recipes with step-by-step instructions and images. Each question in RecipeQA involves multiple modalities such as titles, descriptions or images, and working towards an answer requires (i) joint understanding of images and text, (ii) capturing the temporal flow of events, and (iii) making sense of procedural knowledge.
Event2Mind is a corpus of 25,000 event phrases covering a diverse range of everyday events and situations.
19 PAPERS • 2 BENCHMARKS
Moral Stories is a crowd-sourced dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations, and their respective consequences.
19 PAPERS • NO BENCHMARKS YET
Question: I have five fingers but I am not alive. What am I? Answer: a glove.
17 PAPERS • 1 BENCHMARK
XStoryCloze consists of the professionally translated version of the English StoryCloze dataset (Spring 2016 version) to 10 non-English languages. This dataset is intended to be used for evaluating the zero- and few-shot learning capabilities of multlingual language models. This dataset is released by Meta AI.
11 PAPERS • NO BENCHMARKS YET
Fig-QA consists of 10256 examples of human-written creative metaphors that are paired as a Winograd schema. It can be used to evaluate the commonsense reasoning of models. The metaphors themselves can also be used as training data for other tasks, such as metaphor detection or generation.
10 PAPERS • NO BENCHMARKS YET
TimeDial presents a crowdsourced English challenge set, for temporal commonsense reasoning, formulated as a multiple choice cloze task with around 1.5k carefully curated dialogs. The dataset is derived from the DailyDialog, which is a multi-turn dialog corpus.
X-CSQA is a multilingual dataset for Commonsense reasoning research, based on CSQA.
CC-Stories (or STORIES) is a dataset for common sense reasoning and language modeling. It was constructed by aggregating documents from the CommonCrawl dataset that has the most overlapping n-grams with the questions in commonsense reasoning tasks. The top 1.0% of highest ranked documents is chosen as the new training corpus.
9 PAPERS • NO BENCHMARKS YET
ProtoQA is a question answering dataset for training and evaluating common sense reasoning capabilities of artificial intelligence systems in such prototypical situations. The training set is gathered from an existing set of questions played in a long-running international game show FAMILY- FEUD. The hidden evaluation set is created by gathering answers for each question from 100 crowd-workers.
Rainbow is multi-task benchmark for common-sense reasoning that uses different existing QA datasets: aNLI, Cosmos QA, HellaSWAG. Physical IQa, Social IQa, WinoGrande.
Complementary Commonsense (Com2Sense) is a dataset for benchmarking commonsense reasoning ability of NLP models. This dataset contains 4k statement true/false sentence pairs. The dataset is crowdsourced and enhanced with an adversarial model-in-the-loop setup to incentivize challenging samples. To facilitate a systematic analysis of commonsense capabilities, the dataset is designed along the dimensions of knowledge domains, reasoning scenarios and numeracy.
8 PAPERS • NO BENCHMARKS YET
Housekeep a benchmark to evaluate common sense reasoning in the home for embodied AI. In Housekeep, an embodied agent must tidy a house by rearranging misplaced objects without explicit instructions specifying which objects need to be rearranged. The dataset contains where humans typically place objects in tidy and untidy houses constituting 1799 objects, 268 object categories, 585 placements, and 105 rooms.
A Benchmark for Robust Multi-Hop Spatial Reasoning in Texts
8 PAPERS • 1 BENCHMARK
Choice of Plausible Alternatives for Russian language (PARus) evaluation provides researchers with a tool for assessing progress in open-domain commonsense causal reasoning. Each question in PARus is composed of a premise and two alternatives, where the task is to select the alternative that more plausibly has a causal relation with the premise. The correct alternative is randomized so that the expected performance of randomly guessing is 50%.
7 PAPERS • 1 BENCHMARK
A Winograd schema is a pair of sentences that differ in only one or two words and that contain an ambiguity that is resolved in opposite ways in the two sentences and requires the use of world knowledge and reasoning for its resolution. The schema takes its name from a well-known example by Terry Winograd.
CITE is a crowd-sourced resource for multimodal discourse: this resource characterises inferences in image-text contexts in the domain of cooking recipes in the form of coherence relations.
6 PAPERS • 1 BENCHMARK
OMICS is an extensive collection of knowledge for indoor service robots gathered from internet users. Currently, it contains 48 tables capturing different sorts of knowledge. Each tuple of the Help table maps a user desire to a task that may meet the desire (e.g., ⟨ “feel thirsty”, “by offering drink” ⟩). Each tuple of the Tasks/Steps table decomposes a task into several steps (e.g., ⟨ “serve a drink”, 0. “get a glass”, 1. “get a bottle”, 2. “fill class from bottle”, 3. “give class to person” ⟩). Given this, OMICS offers useful knowledge about hierarchism of naturalistic instructions, where a high-level user request (e.g., “serve a drink”) can be reduced to lower-level tasks (e.g., “get a glass”, ⋯). Another feature of OMICS is that elements of any tuple in an OMICS table are semantically related according to a predefined template. This facilitates the semantic interpretation of the OMICS tuples.
6 PAPERS • NO BENCHMARKS YET
Contains 13.6k masked-word-prediction probes, 10.5k for fine-tuning and 3.1k for testing.
5 PAPERS • NO BENCHMARKS YET
Russian reading comprehension with Commonsense reasoning (RuCoS) is a large-scale reading comprehension dataset that requires commonsense reasoning. RuCoS consists of queries automatically generated from CNN/Daily Mail news articles; the answer to each query is a text span from a summarizing passage of the corresponding news. The goal of RuCoS is to evaluate a machine`s ability of commonsense reasoning in reading comprehension.
5 PAPERS • 1 BENCHMARK
The Sarcasm Corpus contains sarcastic and non-sarcastic utterances of three different types, which are balanced with half of the samples being sarcastic and half non-sarcastic. The three types are:
This dataset is collected via the WinoGAViL game to collect challenging vision-and-language associations. Inspired by the popular card game Codenames, a “spymaster” gives a textual cue related to several visual candidates, and another player has to identify them.
4 PAPERS • 2 BENCHMARKS
We provide the BCOPA-CE test set, which has balanced token distribution in the correct and wrong alternatives and increases the difficulty of being aware of cause and effect.
3 PAPERS • NO BENCHMARKS YET
PACS (Physical Audiovisual CommonSense) is the first audiovisual benchmark annotated for physical commonsense attributes. PACS contains a total of 13,400 question-answer pairs, involving 1,377 unique physical commonsense questions and 1,526 videos. The dataset provides new opportunities to advance the research field of physical reasoning by bringing audio as a core component of this multimodal problem.
3 PAPERS • 1 BENCHMARK
CheGeKa is a Jeopardy!-like Russian QA dataset collected from the official Russian quiz database ChGK.
2 PAPERS • 1 BENCHMARK
General-purpose Visual Understanding Evaluation (G-VUE) is a comprehensive benchmark covering the full spectrum of visual cognitive abilities with four functional domains -- Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and manipulation.
2 PAPERS • NO BENCHMARKS YET
Consists of visual arithmetic problems automatically generated using a grammar model--And-Or Graph (AOG). These visual arithmetic problems are in the form of geometric figures: each problem has a set of geometric shapes as its context and embedded number symbols.
We generate epistemic reasoning problems using modal logic to target theory of mind (tom) in natural language processing models.
Recent times have witnessed an increasing number of applications of deep neural networks towards solving tasks that require superior cognitive abilities, e.g., playing Go, generating art, ChatGPT, etc. Such a dramatic progress raises the question: how generalizable are neural networks in solving problems that demand broad skills? To answer this question, we propose SMART: a Simple Multimodal Algorithmic Reasoning Task (and the associated SMART-101 dataset) for evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed specifically for children of younger age (6--8). Our dataset consists of 101 unique puzzles; each puzzle comprises a picture and a question, and their solution needs a mix of several elementary skills, including pattern recognition, algebra, and spatial reasoning, among others. To train deep neural networks, we programmatically augment each puzzle to 2,000 new instances; each instance varied in appea
WHOOPS! Is a dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. It contains commonsense-defying image from a wide range of reasons, deviations from expected social norms and everyday knowledge.
2 PAPERS • 4 BENCHMARKS
The Advice-Seeking Questions (ASQ) dataset is a collection of personal narratives with advice-seeking questions. The dataset has been split into train, test, heldout sets, with 8865, 2500, 10000 test instances each. This dataset is used to train and evaluate methods that can infer what is the advice-seeking goal behind a personal narrative. This task is formulated as a cloze test, where the goal is to identify which of two advice-seeking questions was removed from a given narrative.
1 PAPER • NO BENCHMARKS YET
CriticBench is a comprehensive benchmark designed to assess the abilities of Large Language Models (LLMs) to critique and rectify their reasoning across various tasks. It encompasses five reasoning domains:
DiscoSense is a benchmark sourced from datasets that contain two sentences connected through a discourse connective. Specifically, it is sourced from two peer reviewed academic datasets, DISCOVERY and DISCOFUSE for commonsense reasoning via understanding a wide variety of discourse connectives.
DpgMedia2019 is a Dutch news dataset for partisanship detection. It contains more than 100K articles that are labelled on the publisher level and 776 articles that were crowdsourced using an internal survey platform and labelled on the article level.