The RefCOCO dataset is a referring expression generation (REG) dataset used for tasks related to understanding natural language expressions that refer to specific objects in images. Here are the key details about RefCOCO:
301 PAPERS • 19 BENCHMARKS
SHAPES is a dataset of synthetic images designed to benchmark systems for understanding of spatial and logical relations among multiple objects. The dataset consists of complex questions about arrangements of colored shapes. The questions are built around compositions of concepts and relations, e.g. Is there a red shape above a circle? or Is a red shape blue?. Questions contain between two and four attributes, object types, or relationships. There are 244 questions and 15,616 images in total, with all questions having a yes and no answer (and corresponding supporting image). This eliminates the risk of learning biases.
111 PAPERS • 1 BENCHMARK
Visual Entailment (VE) consists of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal of a trained VE model is to predict whether the image semantically entails the text. SNLI-VE is a dataset for VE which is based on the Stanford Natural Language Inference corpus and Flickr30k dataset.
109 PAPERS • 2 BENCHMARKS
RAVEN consists of 1,120,000 images and 70,000 RPM (Raven's Progressive Matrices) problems, equally distributed in 7 distinct figure configurations.
81 PAPERS • NO BENCHMARKS YET
NLVR contains 92,244 pairs of human-written English sentences grounded in synthetic images. Because the images are synthetically generated, this dataset can be used for semantic parsing.
76 PAPERS • 3 BENCHMARKS
AbstractReasoning is a dataset for abstract reasoning, where the goal is to infer the correct answer from the context panels based on abstract reasoning.
74 PAPERS • NO BENCHMARKS YET
Contains 145k captions for 28k images. The dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects.
64 PAPERS • 1 BENCHMARK
Winoground is a dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning. Given two images and two captions, the goal is to match them correctly -- but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annotators and is labeled with a rich set of fine-grained tags to assist in analyzing model performance.
57 PAPERS • 1 BENCHMARK
PGM dataset serves as a tool for studying both abstract reasoning and generalisation in models. Generalisation is a multi-faceted phenomenon; there is no single, objective way in which models can or should generalise beyond their experience. The PGM dataset provides a means to measure the generalization ability of models in different ways, each of which may be more or less interesting to researchers depending on their intended training setup and applications.
49 PAPERS • NO BENCHMARKS YET
Rendered synthetically using a library of standard 3D objects, and tests the ability to recognize compositions of object movements that require long-term reasoning.
47 PAPERS • 3 BENCHMARKS
The Visual Spatial Reasoning (VSR) corpus is a collection of caption-image pairs with true/false labels. Each caption describes the spatial relation of two individual objects in the image, and a vision-language model (VLM) needs to judge whether the caption is correctly describing the image (True) or not (False).
41 PAPERS • 1 BENCHMARK
FigureQA is a visual reasoning corpus of over one million question-answer pairs grounded in over 100,000 images. The images are synthetic, scientific-style figures from five classes: line plots, dot-line plots, vertical and horizontal bar graphs, and pie charts.
38 PAPERS • 1 BENCHMARK
Benchmark for physical reasoning that contains a set of simple classical mechanics puzzles in a 2D physical environment. The benchmark is designed to encourage the development of learning algorithms that are sample-efficient and generalize well across puzzles.
31 PAPERS • 2 BENCHMARKS
MathVista is a consolidated Mathematical reasoning benchmark within Visual contexts. It consists of three newly created datasets, IQTest, FunctionQA, and PaperQA, which address the missing visual domains and are tailored to evaluate logical reasoning on puzzle test figures, algebraic reasoning over functional plots, and scientific reasoning with academic paper figures, respectively. It also incorporates 9 MathQA datasets and 19 VQA datasets from the literature, which significantly enrich the diversity and complexity of visual perception and mathematical reasoning challenges within our benchmark. In total, MathVista includes 6,141 examples collected from 31 different datasets.
26 PAPERS • NO BENCHMARKS YET
Current visual question answering (VQA) tasks mainly consider answering human-annotated questions for natural images in the daily-life context. Icon question answering (IconQA) is a benchmark which aims to highlight the importance of abstract diagram understanding and comprehensive cognitive reasoning in real-world diagram word problems. For this benchmark, a large-scale IconQA dataset is built that consists of three sub-tasks: multi-image-choice, multi-text-choice, and filling-in-the-blank. Compared to existing VQA benchmarks, IconQA requires not only perception skills like object recognition and text understanding, but also diverse cognitive reasoning skills, such as geometric reasoning, commonsense reasoning, and arithmetic reasoning.
23 PAPERS • 1 BENCHMARK
The Image-Grounded Language Understanding Evaluation (IGLUE) benchmark brings together—by both aggregating pre-existing datasets and creating new ones—visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. The benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups.
21 PAPERS • 13 BENCHMARKS
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. Although many benchmarks attempt to holistically evaluate MLLMs, they typically concentrate on basic reasoning tasks, often yielding only simple yes/no or multi-choice responses. These methods naturally lead to confusion and difficulties in conclusively determining the reasoning capabilities of MLLMs. To mitigate this issue, we manually curate CORE-MM benchmark dataset, specifically designed for MLLMs with a focus on complex reasoning tasks. Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning. The queries in our dataset are intentionally constructed to engage the reasoning capabilities of MLLMs in the process of generating answers. For a fair comparison across various MLLMs, we incorporate intermediate reasoning steps into our evaluation criteria. CORE-MM benchmark consists of 279 manually curated reasoning questions, associate
19 PAPERS • 1 BENCHMARK
Multicultural Reasoning over Vision and Language (MaRVL) is a dataset based on an ImageNet-style hierarchy representative of many languages and cultures (Indonesian, Mandarin Chinese, Swahili, Tamil, and Turkish). The selection of both concepts and images is entirely driven by native speakers. Afterwards, we elicit statements from native speakers about pairs of images. The task consists in discriminating whether each grounded statement is true or false.
18 PAPERS • 3 BENCHMARKS
Social-IQ is an unconstrained benchmark specifically designed to train and evaluate socially intelligent technologies. By providing a rich source of open-ended questions and answers, Social-IQ opens the door to explainable social intelligence. The dataset contains rigorously annotated and validated videos, questions and answers, as well as annotations for the complexity level of each question and answer. Social-IQ contains 1,250 natural in-the-wild social situations, 7,500 questions and 52,500 correct and incorrect answers.
18 PAPERS • NO BENCHMARKS YET
CLEVR-Ref+ is a synthetic diagnostic dataset for referring expression comprehension. The precise locations and attributes of the objects are readily available, and the referring expressions are automatically associated with functional programs. The synthetic nature allows control over dataset bias (through sampling strategy), and the modular programs enable intermediate reasoning ground truth without human annotators.
16 PAPERS • 2 BENCHMARKS
A configurable visual question and answer dataset (COG) to parallel experiments in humans and animals. COG is much simpler than the general problem of video analysis, yet it addresses many of the problems relating to visual and logical reasoning and memory -- problems that remain challenging for modern deep learning architectures.
7 PAPERS • NO BENCHMARKS YET
Cops-Ref is a dataset for visual reasoning in context of referring expression comprehension with two main features.
The Relative Size dataset contains 486 object pairs between 41 physical objects. Size comparisons are not available for all pairs of objects (e.g. bird and watermelon) because for some pairs humans cannot determine which object is bigger.
6 PAPERS • NO BENCHMARKS YET
**Compositional Physical Reasoning is a dataset for understanding object-centric and relational physics properties hidden from visual appearances. For a given set of objects, the dataset includes few videos of them moving and interacting under different initial conditions. The model is evaluated based on its capability to unravel the compositional hidden properties, such as mass and charge, and use this knowledge to answer a set of questions posted on one of the videos.
5 PAPERS • NO BENCHMARKS YET
TRANCE extends CLEVR by asking a uniform question, i.e. what is the transformation between two given images, to test the ability of transformation reasoning. TRANCE includes three levels of settings, i.e. Basic (single-step transformation), Event (multi-step transformation), and View (multi-step transformation with variant views). Detailed information can be found in https://hongxin2019.github.io/TVR.
PGDP5K is a dataset consisting of 5000 diagram samples composed of 16 shapes, covering 5 positional relations, 22 symbol types and 6 text types, labeled with more fine-grained annotations at primitive level, including primitive classes, locations and relationships, where 1,813 non-duplicated images are selected from the Geometry3K dataset and other 3,187 images are collected from three popular textbooks across grades 6-12 on mathematics curriculum websites by taking screenshots from PDF books.
4 PAPERS • 1 BENCHMARK
This dataset is collected via the WinoGAViL game to collect challenging vision-and-language associations. Inspired by the popular card game Codenames, a “spymaster” gives a textual cue related to several visual candidates, and another player has to identify them.
4 PAPERS • 2 BENCHMARKS
ADE-Affordance is a new dataset that builds upon ADE20k, which contains annotations enabling such rich visual reasoning.
3 PAPERS • NO BENCHMARKS YET
KiloGram is a resource for studying abstract visual reasoning in humans and machines. It contains a richly annotated dataset with >1k distinct stimuli.
General-purpose Visual Understanding Evaluation (G-VUE) is a comprehensive benchmark covering the full spectrum of visual cognitive abilities with four functional domains -- Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and manipulation.
2 PAPERS • NO BENCHMARKS YET
Consists of visual arithmetic problems automatically generated using a grammar model--And-Or Graph (AOG). These visual arithmetic problems are in the form of geometric figures: each problem has a set of geometric shapes as its context and embedded number symbols.
Recent times have witnessed an increasing number of applications of deep neural networks towards solving tasks that require superior cognitive abilities, e.g., playing Go, generating art, ChatGPT, etc. Such a dramatic progress raises the question: how generalizable are neural networks in solving problems that demand broad skills? To answer this question, we propose SMART: a Simple Multimodal Algorithmic Reasoning Task (and the associated SMART-101 dataset) for evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed specifically for children of younger age (6--8). Our dataset consists of 101 unique puzzles; each puzzle comprises a picture and a question, and their solution needs a mix of several elementary skills, including pattern recognition, algebra, and spatial reasoning, among others. To train deep neural networks, we programmatically augment each puzzle to 2,000 new instances; each instance varied in appea
Super-CLEVR is a dataset for Visual Question Answering (VQA) where different factors in VQA domain shifts can be isolated in order that their effects can be studied independently. It contains 21 vehicle models belonging to 5 categories, with controllable attributes. Four factors are considered: visual complexity, question redundancy, concept distribution and concept compositionality.
Bongard-OpenWorld is a new benchmark for evaluating real-world few-shot reasoning for machine vision. We hope it can help us better understand the limitations of current visual intelligence and facilitate future research on visual agents with stronger few-shot visual reasoning capabilities.
1 PAPER • 1 BENCHMARK
The IRFL dataset consists of idioms, similes, and metaphors with matching figurative and literal images, as well as two novel tasks of multimodal figurative understanding and preference.
1 PAPER • 2 BENCHMARKS
The Synthetic Visual Reasoning Test (SVRT) is a series of 23 classification problems involving images of randomly generated shapes.
1 PAPER • NO BENCHMARKS YET
Sequence Consistency Evaluation (SCE) consists of a benchmark task for sequence consistency evaluation (SCE).
Visual Analogies of Situation Recognition (VASR) is a dataset for visual analogical mapping, adapting the classical word-analogy task into the visual domain. It contains 196K object transitions and 385K activity transitions. Experiments demonstrate that state-of-the-art models do well when distractors are chosen randomly (~86%), but struggle with carefully chosen distractors (~53%, compared to 90% human accuracy)
Visual Choice of Plausible Alternatives (VCOPA) is an evaluation dataset containing 380 VCOPA questions and over 1K images with various topics, which is amenable to automatic evaluation, and present the performance of baseline reasoning approaches as initial benchmarks for future systems.
lilGym is a benchmark for language-conditioned reinforcement learning in visual environment based on 2,661 highly-compositional human-written natural language statements grounded in an interactive visual environment. Each statement is paired with multiple start states and reward functions to form thousands of distinct Markov Decision Processes of varying difficulty.
A fundamental component of human vision is our ability to parse complex visual scenes and judge the relations between their constituent objects. AI benchmarks for visual reasoning have driven rapid progress in recent years with state-of-the-art systems now reaching human accuracy on some of these benchmarks. Yet, there remains a major gap between humans and AI systems in terms of the sample efficiency with which they learn new visual reasoning tasks. Humans' remarkable efficiency at learning has been at least partially attributed to their ability to harness compositionality -- allowing them to efficiently take advantage of previously gained knowledge when learning new tasks. Here, we introduce a novel visual reasoning benchmark, Compositional Visual Relations (CVR), to drive progress towards the development of more data-efficient learning algorithms. We take inspiration from fluidic intelligence and non-verbal reasoning tests and describe a novel method for creating compositions of abs
0 PAPER • NO BENCHMARKS YET