The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.
10,147 PAPERS • 92 BENCHMARKS
The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators.
731 PAPERS • 9 BENCHMARKS
Automatic image captioning is the task of producing a natural-language utterance (usually a sentence) that correctly reflects the visual content of an image. Up to this point, the resource most used for this task was the MS-COCO dataset, containing around 120,000 images and 5-way image-caption annotations (produced by paid annotators).
312 PAPERS • 2 BENCHMARKS
COCO Captions contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
177 PAPERS • 4 BENCHMARKS
The Hateful Memes data set is a multimodal dataset for hateful meme detection (image + text) that contains 10,000+ new multimodal examples created by Facebook AI. Images were licensed from Getty Images so that researchers can use the data set to support their work.
142 PAPERS • 1 BENCHMARK
The nocaps benchmark consists of 166,100 human-generated captions describing 15,100 images from the OpenImages validation and test sets.
141 PAPERS • 13 BENCHMARKS
The VizWiz-VQA dataset originates from a natural visual question answering setting where blind people each took an image and recorded a spoken question about it, together with 10 crowdsourced answers per visual question. The proposed challenge addresses the following two tasks for this dataset: predict the answer to a visual question and (2) predict whether a visual question cannot be answered.
140 PAPERS • 7 BENCHMARKS
The ReferIt dataset contains 130,525 expressions for referring to 96,654 objects in 19,894 images of natural scenes.
69 PAPERS • NO BENCHMARKS YET
Contains 145k captions for 28k images. The dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects.
64 PAPERS • 1 BENCHMARK
Winoground is a dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning. Given two images and two captions, the goal is to match them correctly -- but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annotators and is labeled with a rich set of fine-grained tags to assist in analyzing model performance.
57 PAPERS • 1 BENCHMARK
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
55 PAPERS • NO BENCHMARKS YET
We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. We ask annotators to describe an image with their voice while simultaneously hovering their mouse over the region they are describing. Since the voice and the mouse pointer are synchronized, we can localize every single word in the description. This dense visual grounding takes the form of a mouse trace segment per word and is unique to our data. We annotated 849k images with Localized Narratives: the whole COCO, Flickr30k, and ADE20K datasets, and 671k images of Open Images, all of which we make publicly available. We provide an extensive analysis of these annotations showing they are diverse, accurate, and efficient to produce. We also demonstrate their utility on the application of controlled image captioning.
53 PAPERS • 5 BENCHMARKS
A new challenge set for multimodal classification, focusing on detecting hate speech in multimodal memes.
41 PAPERS • NO BENCHMARKS YET
The Remote Sensing Image Captioning Dataset (RSICD) is a dataset for remote sensing image captioning task. It contains more than ten thousands remote sensing images which are collected from Google Earth, Baidu Map, MapABC and Tianditu. The images are fixed to 224X224 pixels with various resolutions. The total number of remote sensing images is 10921, with five sentences descriptions per image.
41 PAPERS • 3 BENCHMARKS
A new large-scale dataset for referring expressions, based on MS-COCO.
30 PAPERS • 4 BENCHMARKS
The Image Paragraph Captioning dataset allows researchers to benchmark their progress in generating paragraphs that tell a story about an image. The dataset contains 19,561 images from the Visual Genome dataset. Each image contains one paragraph. The training/val/test sets contains 14,575/2,487/2,489 images.
30 PAPERS • 2 BENCHMARKS
UT Zappos50K is a large shoe dataset consisting of 50,025 catalog images collected from Zappos.com. The images are divided into 4 major categories — shoes, sandals, slippers, and boots — followed by functional types and individual brands. The shoes are centered on a white background and pictured in the same orientation for convenient analysis.
30 PAPERS • 1 BENCHMARK
The SentiCap dataset contains several thousand images with captions with positive and negative sentiments. These sentimental captions are constructed by the authors by re-writing factual descriptions. In total there are 2000+ sentimental captions.
26 PAPERS • NO BENCHMARKS YET
Dataset contains 33,010 molecule-description pairs split into 80\%/10\%/10\% train/val/test splits. The goal of the task is to retrieve the relevant molecule for a natural language description. It is defined as follows:
22 PAPERS • 4 BENCHMARKS
FlickrStyle10K is collected and built on Flickr30K image caption dataset. The original FlickrStyle10K dataset has 10,000 pairs of images and stylized captions including humorous and romantic styles. However, only 7,000 pairs from the official training set are now publicly accessible. The dataset can be downloaded via https://zhegan27.github.io/Papers/FlickrStyle_v0.9.zip
22 PAPERS • 2 BENCHMARKS
COCO-CN is a bilingual image description dataset enriching MS-COCO with manually written Chinese sentences and tags. The new dataset can be used for multiple tasks including image tagging, captioning and retrieval, all in a cross-lingual setting.
20 PAPERS • 3 BENCHMARKS
WIDER is a dataset for complex event recognition from static images. As of v0.1, it contains 61 event categories and around 50574 images annotated with event class labels.
20 PAPERS • 1 BENCHMARK
Consists of over 39,000 images originating from people who are blind that are each paired with five captions.
12 PAPERS • NO BENCHMARKS YET
IU X-ray (Demner-Fushman et al., 2016) is a set of chest X-ray images paired with their corresponding diagnostic reports. The dataset contains 7,470 pairs of images and reports.
11 PAPERS • 1 BENCHMARK
A collection that allows researchers to approach the extremely challenging problem of description generation using relatively simple non-parametric methods and produces surprisingly effective results.
10 PAPERS • NO BENCHMARKS YET
SCICAP is a large-scale image captioning dataset that contains real-world scientific figures and captions. SCICAP was constructed using more than two million images from over 290,000 papers collected and released by arXiv.
10 PAPERS • 1 BENCHMARK
DIOR-RSVG is a large-scale benchmark dataset of remote sensing data (RSVG). It aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language. This new dataset includes image/expression/box triplets for training and evaluating visual grounding models.
7 PAPERS • NO BENCHMARKS YET
This dataset is a collection of spoken language instructions for a robotic system to pick and place common objects. Text instructions and corresponding object images are provided. The dataset consists of situations where the robot is instructed by the operator to pick up a specific object and move it to another location: for example, Move the blue and white tissue box to the top right bin. This dataset consists of RGBD images, bounding box annotations, destination box annotations, and text instructions.
CITE is a crowd-sourced resource for multimodal discourse: this resource characterises inferences in image-text contexts in the domain of cooking recipes in the form of coherence relations.
6 PAPERS • 1 BENCHMARK
This dataset consists of images and annotations in Bengali. The images are human annotated in Bengali by two adult native Bengali speakers. All popular image captioning datasets have a predominant western cultural bias with the annotations done in English. Using such datasets to train an image captioning system assumes that a good English to target language translation system exists and that the original dataset had elements of the target culture. Both these assumptions are false, leading to the need of a culturally relevant dataset in Bengali, to generate appropriate image captions of images relevant to the Bangladeshi and wider subcontinental context. The dataset presented consists of 9,154 images.
5 PAPERS • 1 BENCHMARK
PoMo consists of more than 231K sentences with post-modifiers and associated facts extracted from Wikidata for around 57K unique entities.
5 PAPERS • NO BENCHMARKS YET
Tencent ML-Images is a large open-source multi-label image database, including 17,609,752 training and 88,739 validation image URLs, which are annotated with up to 11,166 categories.
UIT-ViIC contains manually written captions for images from Microsoft COCO dataset relating to sports played with ball. UIT-ViIC consists of 19,250 Vietnamese captions for 3,850 images.
Peir Gross (Jing et al., 2018) was collected with descriptions in the Gross sub-collection from PEIR digital library, resulting in 7.442 image-caption pairs from 21 different sub-categories. Each caption contains only one sentence.
4 PAPERS • 1 BENCHMARK
STAIR Captions is a large-scale dataset containing 820,310 Japanese captions. This dataset can be used for caption generation, multimodal retrieval, and image generation.
4 PAPERS • NO BENCHMARKS YET
Concadia is a publicly available Wikipedia-based corpus, which consists of 96,918 images with corresponding English-language descriptions, captions, and surrounding context.
3 PAPERS • 1 BENCHMARK
Contains 8k flickr Images with captions. Visit this page to explore the data.
A new language-guided image editing dataset that contains a large number of real image pairs with corresponding editing instructions.
3 PAPERS • NO BENCHMARKS YET
The Image and Video Advertisements collection consists of an image dataset of 64,832 image ads, and a video dataset of 3,477 ads. The data contains rich annotations encompassing the topic and sentiment of the ads, questions and answers describing what actions the viewer is prompted to take and the reasoning that the ad presents to persuade the viewer ("What should I do according to this ad, and why should I do it? "), and symbolic references ads make (e.g. a dove symbolizes peace).
LAION-COCO is the world’s largest dataset of 600M generated high-quality captions for publicly available web-images. The images are extracted from the english subset of Laion-5B with an ensemble of BLIP L/14 and 2 CLIP versions (L/14 and RN50x64). This dataset allow models to produce high quality captions for images.
Open Images is a computer vision dataset covering ~9 million images with labels spanning thousands of object categories. A subset of 1.9M includes diverse annotations types.
This is an open-source image captions dataset for the aesthetic evaluation of images. The dataset is called DPC-Captions, which contains comments of up to five aesthetic attributes of one image through knowledge transfer from a full-annotated small-scale dataset.
2 PAPERS • NO BENCHMARKS YET
DeCOCO is a bilingual (English-German) corpus of image descriptions, where the English part is extracted from the COCO dataset, and the German part are translations by a native German speaker.
Egoshots is a 2-month Ego-vision Dataset with Autographer Wearable Camera annotated "for free" with transfer learning. Three state of the art pre-trained image captioning models are used. The dataset represents the life of 2 interns while working at Philips Research (Netherlands) (May-July 2015) generously donating their data.
Hephaestus is the first large-scale InSAR dataset. Driven by volcanic unrest detection, it provides 19,919 unique satellite frames annotated with a diverse set of labels. Moreover, each sample is accompanied by a textual description of its contents. The goal of this dataset is to boost research on exploitation of interferometric data enabling the application of state-of-the-art computer vision+NLP methods. Furthermore, the annotated dataset is bundled with a large archive of unlabeled frames to enable large-scale self-supervised learning. The final size of the dataset amounts to 110,573 interferograms.
VizWiz-Priv includes 8,862 regions showing private content across 5,537 images taken by blind people. Of these, 1,403 are paired with questions and 62% of those directly ask about the private content.
A large-scale dataset that links the assessment of image quality issues to two practical vision tasks: image captioning and visual question answering.
WHOOPS! Is a dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. It contains commonsense-defying image from a wide range of reasons, deviations from expected social norms and everyday knowledge.
2 PAPERS • 4 BENCHMARKS