The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.
10,147 PAPERS • 92 BENCHMARKS
The GQA dataset is a large-scale visual question answering dataset with real images from the Visual Genome dataset and balanced question-answer pairs. Each training and validation image is also associated with scene graph annotations describing the classes and attributes of those objects in the scene, and their pairwise relations. Along with the images and question-answer pairs, the GQA dataset provides two types of pre-extracted visual features for each image – convolutional grid features of size 7×7×2048 extracted from a ResNet-101 network trained on ImageNet, and object detection features of size Ndet×2048 (where Ndet is the number of detected objects in each image with a maximum of 100 per image) from a Faster R-CNN detector.
431 PAPERS • 5 BENCHMARKS
Visual Question Answering (VQA) v2.0 is a dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. It is the second version of the VQA dataset.
342 PAPERS • 7 BENCHMARKS
TextVQA is a dataset to benchmark visual reasoning based on text in images. TextVQA requires models to read and reason about text in images to answer questions about them. Specifically, models need to incorporate a new modality of text present in the images and reason over it to answer TextVQA questions.
201 PAPERS • 2 BENCHMARKS
The VizWiz-VQA dataset originates from a natural visual question answering setting where blind people each took an image and recorded a spoken question about it, together with 10 crowdsourced answers per visual question. The proposed challenge addresses the following two tasks for this dataset: predict the answer to a visual question and (2) predict whether a visual question cannot be answered.
140 PAPERS • 7 BENCHMARKS
MMBench is a multi-modality benchmark. It methodically develops a comprehensive evaluation pipeline, primarily comprised of two elements. The first element is a meticulously curated dataset that surpasses existing similar benchmarks in terms of the number and variety of evaluation questions and abilities. The second element introduces a novel CircularEval strategy and incorporates the use of ChatGPT. This implementation is designed to convert free-form predictions into pre-defined choices, thereby facilitating a more robust evaluation of the model's predictions.
97 PAPERS • 1 BENCHMARK
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
92 PAPERS • 1 BENCHMARK
The MSR-VTT-QA dataset is a benchmark for the task of Visual Question Answering (VQA) on the MSR-VTT (Microsoft Research Video to Text) dataset. The MSR-VTT-QA benchmark is used to evaluate models on their ability to answer questions based on these videos. It's part of the tasks that this dataset is used for, along with Video Retrieval, Video Captioning, Zero-Shot Video Question Answering, Zero-Shot Video Retrieval, and Text-to-Video Generation.
55 PAPERS • 6 BENCHMARKS
The MSVD-QA dataset is a Video Question Answering (VideoQA) dataset. It is based on the existing Microsoft Research Video Description (MSVD) dataset, which consists of about 120K sentences describing more than 2,000 video snippets. In the MSVD-QA dataset, Question-Answer (QA) pairs are generated from these descriptions. The dataset is mainly used in video captioning experiments but due to its large data size, it is also used for VideoQA. It contains 1970 video clips and approximately 50.5K QA pairs.
49 PAPERS • 5 BENCHMARKS
PlotQA is a VQA dataset with 28.9 million question-answer pairs grounded over 224,377 plots on data from real-world sources and questions based on crowd-sourced question templates. Existing synthetic datasets (FigureQA, DVQA) for reasoning over plots do not contain variability in data labels, real-valued data, or complex reasoning questions. Consequently, proposed models for these datasets do not fully address the challenge of reasoning over plots. In particular, they assume that the answer comes either from a small fixed size vocabulary or from a bounding box within the image. However, in practice this is an unrealistic assumption because many questions require reasoning and thus have real valued answers which appear neither in a small fixed size vocabulary nor in the image. In this work, we aim to bridge this gap between existing datasets and real world plots by introducing PlotQA. Further, 80.76% of the out-of-vocabulary (OOV) questions in PlotQA have answers that are not in a fixed
27 PAPERS • 5 BENCHMARKS
MathVista is a consolidated Mathematical reasoning benchmark within Visual contexts. It consists of three newly created datasets, IQTest, FunctionQA, and PaperQA, which address the missing visual domains and are tailored to evaluate logical reasoning on puzzle test figures, algebraic reasoning over functional plots, and scientific reasoning with academic paper figures, respectively. It also incorporates 9 MathQA datasets and 19 VQA datasets from the literature, which significantly enrich the diversity and complexity of visual perception and mathematical reasoning challenges within our benchmark. In total, MathVista includes 6,141 examples collected from 31 different datasets.
26 PAPERS • NO BENCHMARKS YET
The General Robust Image Task (GRIT) Benchmark is an evaluation-only benchmark for evaluating the performance and robustness of vision systems across multiple image prediction tasks, concepts, and data sources. GRIT hopes to encourage our research community to pursue the following research directions:
13 PAPERS • 8 BENCHMARKS
ViP-Bench is a comprehensive benchmark designed to assess the capability of multimodal models in understanding visual prompts across multiple dimensions. It aims to evaluate how well these models interpret various visual prompts, including recognition, OCR, knowledge, math, relationship reasoning, and language generation. ViP-Bench includes a diverse set of 303 images and questions, providing a thorough assessment of visual understanding capabilities at the region level. This benchmark sets a foundation for future research into multimodal models with arbitrary visual prompts.
10 PAPERS • 1 BENCHMARK
Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown remarkable capabilities in visual reasoning with common image styles. However, their robustness against diverse style shifts, crucial for practical applications, remains largely unexplored. In this paper, we propose a new benchmark, BenchLMM, to assess the robustness of LMMs against three different styles: artistic image style, imaging sensor style, and application style, where each style has five sub-styles. Utilizing BenchLMM, we comprehensively evaluate state-of-the-art LMMs and reveal: 1) LMMs generally suffer performance degradation when working with other styles; 2) An LMM performs better than another model in common style does not guarantee its superior performance in other styles; 3) LMMs' reasoning capability can be enhanced by prompting LMMs to predict the style first, based on which we propose a versatile and training-free method for improving LMMs; 4) An intelligent LMM is expected to interpret the causes of
9 PAPERS • 1 BENCHMARK
The VNHSGE (VietNamese High School Graduation Examination) dataset, developed exclusively for evaluating large language models (LLMs), is introduced in this article. The dataset, which covers nine subjects, was generated from the Vietnamese National High School Graduation Examination and comparable tests. 300 literary essays have been included, and there are over 19,000 multiple-choice questions on a range of topics. The dataset assesses LLMs in multitasking situations such as question answering, text generation, reading comprehension, visual question answering, and more by including both textual data and accompanying images. Using ChatGPT and BingChat, we evaluated LLMs on the VNHSGE dataset and contrasted their performance with that of Vietnamese students to see how well they performed. The results show that ChatGPT and BingChat both perform at a human level in a number of areas, including literature, English, history, geography, and civics education. They still have space to grow, t
7 PAPERS • 9 BENCHMARKS
SciGraphQA is a large-scale, open-domain dataset focused on generating multi-turn conversational question-answering dialogues centered around understanding and describing scientific graphs and figures. It contains over 300,000 samples derived from academic research papers in computer science and machine learning domains.
3 PAPERS • 1 BENCHMARK
UIT-EVJVQA, the first multilingual Visual Question Answering dataset with three languages: English, Vietnamese, and Japanese, is released in this task. UIT-EVJVQA includes question-answer pairs created by humans on a set of images taken in Vietnam, with the answer created from the input question and the corresponding image. UIT-EVJVQA consists of 33,000+ question-answer pairs for evaluating the mQA models.
3 PAPERS • NO BENCHMARKS YET
In recent years, visual question answering (VQA) has attracted attention from the research community because of its highly potential applications (such as virtual assistance on intelligent cars, assistant devices for blind people, or information retrieval from document images using natural language as queries) and challenge. The VQA task requires methods that have the ability to fuse the information from questions and images to produce appropriate answers. Neural visual question answering models have achieved tremendous growth on large-scale datasets which are mostly for resource-rich languages such as English. However, available datasets narrow the VQA task as the answers selection task or answer classification task. We argue that this form of VQA is far from human ability and eliminates the challenge of the answering aspect in the VQA task by just selecting answers rather than generating them. In this paper, we introduce the OpenViVQA (Open-domain Vietnamese Visual Question Answering
2 PAPERS • NO BENCHMARKS YET
GQA-OOD is a new dataset and benchmark for the evaluation of VQA models in OOD (out of distribution) settings.
1 PAPER • NO BENCHMARKS YET