The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.
10,147 PAPERS • 92 BENCHMARKS
The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of 200 subcategories belonging to birds, 5,994 for training and 5,794 for testing. Each image has detailed annotations: 1 subcategory label, 15 part locations, 312 binary attributes and 1 bounding box. The textual information comes from Reed et al.. They expand the CUB-200-2011 dataset by collecting fine-grained natural language descriptions. Ten single-sentence descriptions are collected for each image. The natural language descriptions are collected through the Amazon Mechanical Turk (AMT) platform, and are required at least 10 words, without any information of subcategories and actions.
1,955 PAPERS • 44 BENCHMARKS
Oxford 102 Flower is an image classification dataset consisting of 102 flower categories. The flowers chosen to be flower commonly occurring in the United Kingdom. Each class consists of between 40 and 258 images.
1,044 PAPERS • 14 BENCHMARKS
Automatic image captioning is the task of producing a natural-language utterance (usually a sentence) that correctly reflects the visual content of an image. Up to this point, the resource most used for this task was the MS-COCO dataset, containing around 120,000 images and 5-way image-caption annotations (produced by paid annotators).
312 PAPERS • 2 BENCHMARKS
Fashion-Gen consists of 293,008 high definition (1360 x 1360 pixels) fashion images paired with item descriptions provided by professional stylists. Each item is photographed from a variety of angles.
30 PAPERS • NO BENCHMARKS YET
Multi-Modal-CelebA-HQ is a large-scale face image dataset that has 30,000 high-resolution face images selected from the CelebA dataset by following CelebA-HQ. Each image has high-quality segmentation mask, sketch, descriptive text, and image with transparent background.
27 PAPERS • 3 BENCHMARKS
A dataset of 90,000 high-resolution nature landscape images, crawled from Unsplash and Flickr and preprocessed with Mask R-CNN and Inception V3.
24 PAPERS • 4 BENCHMARKS
Pick-a-Pic dataset was created by logging user interactions with the Pick-a-Pic web application for text-to image generation. Overall, the Pick-a-Pic dataset contains over 500,000 examples and 35,000 distinct prompts. Each example contains a prompt, two generated images, and a label for which image is preferred, or if there is a tie when no image is significantly preferred over the other.
15 PAPERS • NO BENCHMARKS YET
T2I-CompBench is a comprehensive benchmark for open-world compositional text-to-image generation, consisting of 6,000 compositional textual prompts from 3 categories (attribute binding, object relationships, and complex compositions) and 6 sub-categories (color binding, shape binding, texture binding, spatial relationships, non-spatial relationships, and complex compositions).
12 PAPERS • NO BENCHMARKS YET
Human-Art is a versatile human-centric dataset to bridge the gap between natural and artificial scenes. It includes twenty high-quality human scenes, including natural and artificial humans in both 2D representation and 3D representation. It includes 50,000 images including more than 123,000 human figures in 20 scenarios, with annotations of human bounding box, 21 2D human keypoints, human self-contact keypoints, and description text.
6 PAPERS • 1 BENCHMARK
LAION-COCO is the world’s largest dataset of 600M generated high-quality captions for publicly available web-images. The images are extracted from the english subset of Laion-5B with an ensemble of BLIP L/14 and 2 CLIP versions (L/14 and RN50x64). This dataset allow models to produce high quality captions for images.
3 PAPERS • 1 BENCHMARK
HRS-Bench is a concrete evaluation benchmark for T2I models that is Holistic, Reliable, and Scalable. It measures 13 skills that can be categorized into five major categories: accuracy, robustness, generalization, fairness, and bias. In addition, HRS-Bench covers 50 scenarios, including fashion, animals, transportation, food, and clothes.
2 PAPERS • NO BENCHMARKS YET
Paper2Fig100k is a dataset with over 100k images of figures and texts from research papers. The figures show architecture diagrams and methodologies of articles available at arXiv.org from fields like artificial intelligence and computer vision. Figures usually include text and discrete objects, e.g., boxes in a diagram, with lines and arrows that connect them.
A large dataset of color names and their respective RGB values stores in CSV.
1 PAPER • 1 BENCHMARK
ENTIGEN is a benchmark dataset to evaluate the change in image generations conditional on ethical interventions across three social axes -- gender, skin color, and culture. It contains 246 prompts based on an attribute set containing diverse professions, objects, and cultural scenarios.
1 PAPER • NO BENCHMARKS YET
We introduce GLAMI-1M: the largest multilingual image-text classification dataset and benchmark. The dataset contains images of fashion products with item descriptions, each in 1 of 13 languages. Categorization into 191 classes has high-quality annotations: all 100k images in the test set and 75% of the 1M training set were human-labeled. The paper presents baselines for image-text classification showing that the dataset presents a challenging fine-grained classification problem: The best scoring EmbraceNet model using both visual and textual features achieves 69.7% accuracy. Experiments with a modified Imagen model show the dataset is also suitable for image generation conditioned on text.
A Filipino multi-modal language dataset for image-conditional language generation and text-conditional image generation. Consists of 351,755 Filipino news articles gathered from Filipino news outlets.
0 PAPER • NO BENCHMARKS YET
We construct a fine-grained video-text dataset with 12K annotated high-resolution videos (~400k clips). The annotation of this dataset is inspired by the video script. If we want to make a video, we have to first write a script to organize how to shoot the scenes in the videos. To shoot a scene, we need to decide the content, shot type (medium shot, close-up, etc), and how the camera moves (panning, tilting, etc). Therefore, we extend video captioning to video scripting by annotating the videos in the format of video scripts. Different from the previous video-text datasets, we densely annotate the entire videos without discarding any scenes and each scene has a caption with ~145 words. Besides the vision modality, we transcribe the voice-over into text and put it along with the video title to give more background information for annotating the videos.