Visual Commonsense Reasoning (VCR) is a large-scale dataset for cognition-level visual understanding. Given a challenging question about an image, machines need to present two sub-tasks: answer correctly and provide a rationale justifying its answer. The VCR dataset contains over 212K (training), 26K (validation) and 25K (testing) questions, answers and rationales derived from 110K movie scenes.
158 PAPERS • 13 BENCHMARKS
e-SNLI-VE is a large VL (vision-language) dataset with NLEs (natural language explanations) with over 430k instances for which the explanations rely on the image content. It has been built by merging the explanations from e-SNLI and the image-sentence pairs from SNLI-VE.
15 PAPERS • 2 BENCHMARKS
The ability to recognize analogies is fundamental to human cognition. Existing benchmarks to test word analogy do not reveal the underneath process of analogical reasoning of neural models.
8 PAPERS • NO BENCHMARKS YET
CLEVR-X is a dataset that extends the CLEVR dataset with natural language explanations in the context of VQA. It consists of 3.6 million natural language explanations for 850k question-image pairs.
4 PAPERS • 1 BENCHMARK
Can language models read biomedical texts and explain the biomedical mechanisms discussed? In this work we introduce a biomedical mechanism summarization task. Biomedical studies often investigate the mechanisms behind how one entity (e.g., a protein or a chemical) affects another in a biological context. The abstracts of these publications often include a focused set of sentences that present relevant supporting statements regarding such relationships, associated experimental evidence, and a concluding sentence that summarizes the mechanism underlying the relationship. We leverage this structure and create a summarization task, where the input is a collection of sentences and the main entities in an abstract, and the output includes the relationship and a sentence that summarizes the mechanism. Using a small amount of manually labeled mechanism sentences, we train a mechanism sentence classifier to filter a large biomedical abstract collection and create a summarization dataset with 2
2 PAPERS • NO BENCHMARKS YET
WHOOPS! Is a dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. It contains commonsense-defying image from a wide range of reasons, deviations from expected social norms and everyday knowledge.
2 PAPERS • 4 BENCHMARKS
E-ReDial is a conversational recommender system dataset with high-quality explanations. It consists of 756 dialogues with 12,003 utterances, each with 15.9 turns on average. 2,058 high-quality explanations are included, each with 79.2 tokens on average.
1 PAPER • NO BENCHMARKS YET
ExPUNations is a humor dataset with such extensive and fine-grained annotations specifically for puns. This dataset is designed for two new tasks namely, explanation generation to aid with pun classification and keyword-conditioned pun generation
Reasoning over spans of tokens from different parts of the input is essential for natural language understanding (NLU) tasks such as fact-checking (FC), machine reading comprehension (MRC) or natural language inference (NLI). We introduce SpanEx, a multi-annotator dataset of human-annotated span interaction explanations for two NLU tasks: NLI and FC.