The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of 200 subcategories belonging to birds, 5,994 for training and 5,794 for testing. Each image has detailed annotations: 1 subcategory label, 15 part locations, 312 binary attributes and 1 bounding box. The textual information comes from Reed et al.. They expand the CUB-200-2011 dataset by collecting fine-grained natural language descriptions. Ten single-sentence descriptions are collected for each image. The natural language descriptions are collected through the Amazon Mechanical Turk (AMT) platform, and are required at least 10 words, without any information of subcategories and actions.
1,955 PAPERS • 44 BENCHMARKS
Science Question Answering (ScienceQA) is a new benchmark that consists of 21,208 multimodal multiple choice questions with diverse science topics and annotations of their answers with corresponding lectures and explanations. Out of the questions in ScienceQA, 10,332 (48.7%) have an image context, 10,220 (48.2%) have a text context, and 6,532 (30.8%) have both. Most questions are annotated with grounded lectures (83.9%) and detailed explanations (90.5%). The lecture and explanation provide general external knowledge and specific reasons, respectively, for arriving at the correct answer. To the best of our knowledge, ScienceQA is the first large-scale multimodal dataset that annotates lectures and explanations for the answers.
141 PAPERS • 1 BENCHMARK
The dataset contains single-shot videos taken from moving cameras in underwater environments. The first shard of a new Marine Video Kit dataset is presented to serve for video retrieval and other computer vision challenges. In addition to basic meta-data statistics, we present several insights based on low-level features as well as semantic annotations of selected keyframes. 1379 videos with a length from 2 s to 4.95 min, with the mean and median duration of each video is 29.9 s, and 25.4 s, respectively. We capture data from 11 different regions and countries during the time from 2011 to 2022.
7 PAPERS • 1 BENCHMARK
Clinical diagnosis of the eye is performed over multifarious data modalities including scalar clinical labels, vectorized biomarkers, two-dimensional fundus images, and three-dimensional Optical Coherence Tomography (OCT) scans. While the clinical labels, fundus images and OCT scans are instrumental measurements, the vectorized biomarkers are interpreted attributes from the other measurements. Clinical practitioners use all these data modalities for diagnosing and treating eye diseases like Diabetic Retinopathy (DR) or Diabetic Macular Edema (DME). Enabling usage of machine learning algorithms within the ophthalmic medical domain requires research into the relationships and interactions between these relevant data modalities. Existing datasets are limited in that: (i) they view the problem as disease prediction without assessing biomarkers, and (ii) they do not consider the explicit relationship among all four data modalities over the treatment period. In this paper, we introduce the O
4 PAPERS • NO BENCHMARKS YET
Recent accelerations in multi-modal applications have been made possible with the plethora of image and text data available online. However, the scarcity of similar data in the medical field, specifically in histopathology, has halted similar progress. To enable similar representation learning for histopathology, we turn to YouTube, an untapped resource of videos, offering 1,087 hours of valuable educational histopathology videos from expert clinicians. From YouTube, we curate Quilt: a large-scale vision-language dataset consisting of 768,826 image and text pairs. Quilt was automatically curated using a mixture of models, including large language models), handcrafted algorithms, human knowledge databases, and automatic speech recognition. In comparison, the most comprehensive datasets curated for histopathology amass only around 200K samples. We combine Quilt with datasets, from other sources, including Twitter, research papers, and the internet in general, to create an even larger dat
Multimodal object recognition is still an emerging field. Thus, publicly available datasets are still rare and of small size. This dataset was developed to help fill this void and presents multimodal data for 63 objects with some visual and haptic ambiguity. The dataset contains visual, kinesthetic and tactile (audio/vibrations) data. To completely solve sensory ambiguity, sensory integration/fusion would be required. This report describes the creation and structure of the dataset. The first section explains the underlying approach used to capture the visual and haptic properties of the objects. The second section describes the technical aspects (experimental setup) needed for the collection of the data. The third section introduces the objects, while the final section describes the structure and content of the dataset.
1 PAPER • NO BENCHMARKS YET
Boombox is a multi-modal dataset for visual reconstruction from acoustic vibrations. Involves dropping objects into a box and capturing resulting images and vibrations. Used for training ML systems that predict images from vibration.
Correlated Corrupted Dataset is an evaluation set that consists of realistic visible-infrared (V-I) corruptions allowing for models' corruption robustness evaluation. Initially proposed for multimodal person re-identification, our dataset can also be used for the evaluation of V-I cross-modal approaches. Corruptions of the visible modality are the twenty corruptions proposed by Chen & al. in the "Benchmarks for Corruption Invariant Person Re-identification" paper. Corruptions of the infrared modalities have been proposed in our paper, introducing 19 corruptions that respect the infrared modality encoding. In practice, for co-located visible-infrared cameras, weather-related corruptions should, for example, affect each camera. Also, blur-related corruption would likely occur in both visible and infrared cameras. This dataset tackles this aspect by considering the eventual correlations that may occur from one modality camera to another.
We provide a custom synthetic bimodal dataset, called GeBiD, designed specifically for the comparison of the joint- and cross-generative capabilities of Multimodal Variational Autoencoders. It comprises RGB images of geometric primitives and textual descriptions. The dataset offers 5 levels of difficulty (based on the number of attributes) to find the minimal functioning scenario for each model. Moreover, its rigid structure enables automatic qualitative evaluation of the generated samples.
Dataset for multimodal skills assessment focusing on assessing piano player’s skill level. Annotations include player's skills level, and song difficulty level. Bounding box annotations around pianists' hands are also provided.
1 PAPER • 3 BENCHMARKS
Recent advances in large language models have led to the development of multimodal LLMs (MLLMs), which take both image data and text as an input. Virtually all of these models have been announced within the past year, leading to a significant need for benchmarks evaluating the abilities of these models to reason truthfully and accurately on a diverse set of tasks. When Google announced Gemini (Gemini Team et al., 2023), they showcased its ability to solve rebuses—wordplay puzzles which involve creatively adding and subtracting letters from words derived from text and images. The diversity of rebuses allows for a broad evaluation of multimodal reasoning capabilities, including image recognition, multi- step reasoning, and understanding the human creator’s intent. We present REBUS: a collection of 333 hand-crafted rebuses spanning 13 diverse cate- gories, including hand-drawn and digital images created by nine contributors. Samples are presented in Table 1. Notably, GPT-4V, the most powe
1 PAPER • 1 BENCHMARK
Uncorrelated Corrupted Dataset is an evaluation set that consists of realistic visible-infrared (V-I) corruptions allowing for models' corruption robustness evaluation. Initially proposed for multimodal person re-identification, our dataset can also be used for the evaluation of V-I cross-modal approaches. Corruptions of the visible modality are the twenty corruptions proposed by Chen & al. in the "Benchmarks for Corruption Invariant Person Re-identification" paper. Corruptions of the infrared modalities have been proposed in our paper, introducing 19 corruptions that respect the infrared modality encoding. In practice, the corruptions are applied randomly and independently to the visible and the infrared cameras, making it more suited to a not co-located camera setting.
WebLI (Web Language Image) is a web-scale multilingual image-text dataset, designed to support Google’s vision-language research, such as the large-scale pre-training for image understanding, image captioning, visual question answering, object detection etc.
This dataset endeavors to fill the research void by presenting a meticulously curated collection of misogynistic memes in a code-mixed language of Hindi and English. It introduces two sub-tasks: the first entails a binary classification to determine the presence of misogyny in a meme, while the second task involves categorizing the misogynistic memes into multiple labels, including Objectification, Prejudice, and Humiliation.
0 PAPER • NO BENCHMARKS YET
Mudestreda Multimodal Device State Recognition Dataset obtained from real industrial milling device with Time Series and Image Data for Classification, Regression, Anomaly Detection, Remaining Useful Life (RUL) estimation, Signal Drift measurement, Zero Shot Flank Took Wear, and Feature Engineering purposes.
Facial landmark detection is a cornerstone in many facial analysis tasks such as face recognition, drowsiness detection, and facial expression recognition. Numerous methodologies were introduced to achieve accurate and efficient facial landmark localization in visual images. However, there are only several works that address facial landmark detection in thermal images. The main challenge is the limited number of annotated datasets. In this work, we present a thermal face dataset with annotated face bounding boxes and facial landmarks. The dataset contains 2,556 thermal images of 142 individuals, where each thermal image is paired with the corresponding visual image. To the best of our knowledge, our dataset is the largest in terms of the number of individuals. In addition, our dataset can be employed for tasks such as thermal-to-visual image translation, thermal-visual face recognition, and others. We trained two models for the facial landmark detection task to show the efficacy of our
Human activity recognition and clinical biomechanics are challenging problems in physical telerehabilitation medicine. However, most publicly available datasets on human body movements cannot be used to study both problems in an out-of-the-lab movement acquisition setting. The objective of the VIDIMU dataset is to pave the way towards affordable patient tracking solutions for remote daily life activities recognition and kinematic analysis.