The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of 200 subcategories belonging to birds, 5,994 for training and 5,794 for testing. Each image has detailed annotations: 1 subcategory label, 15 part locations, 312 binary attributes and 1 bounding box. The textual information comes from Reed et al.. They expand the CUB-200-2011 dataset by collecting fine-grained natural language descriptions. Ten single-sentence descriptions are collected for each image. The natural language descriptions are collected through the Amazon Mechanical Turk (AMT) platform, and are required at least 10 words, without any information of subcategories and actions.
1,955 PAPERS • 44 BENCHMARKS
The CASIA-WebFace dataset is used for face verification and face identification tasks. The dataset contains 494,414 face images of 10,575 real identities collected from the web.
381 PAPERS • 2 BENCHMARKS
Stanford Online Products (SOP) dataset has 22,634 classes with 120,053 product images. The first 11,318 classes (59,551 images) are split for training and the other 11,316 (60,502 images) classes are used for testing
221 PAPERS • 5 BENCHMARKS
In-shop Clothes Retrieval Benchmark evaluates the performance of in-shop Clothes Retrieval. This is a large subset of DeepFashion, containing large pose and scale variations. It also has large diversities, large quantities, and rich annotations, including:
150 PAPERS • 2 BENCHMARKS
The Viewpoint Invariant Pedestrian Recognition (VIPeR) dataset includes 632 people and two outdoor cameras under different viewpoints and light conditions. Each person has one image per camera and each image has been scaled to be 128×48 pixels. It provides the pose angle of each person as 0° (front), 45°, 90° (right), 135°, and 180° (back).
132 PAPERS • NO BENCHMARKS YET
Structured3D is a large-scale photo-realistic dataset containing 3.5K house designs (a) created by professional designers with a variety of ground truth 3D structure annotations (b) and generate photo-realistic 2D images (c). The dataset consists of rendering images and corresponding ground truth annotations (e.g., semantic, albedo, depth, surface normal, layout) under different lighting and furniture configurations.
65 PAPERS • 1 BENCHMARK
ABO is a large-scale dataset designed for material prediction and multi-view retrieval experiments. The dataset contains Blender renderings of 30 viewpoints for each of the 7,953 3D objects, as well as camera intrinsics and extrinsic for each rendering.
46 PAPERS • NO BENCHMARKS YET
CARS196 is composed of 16,185 car images of 196 classes.
41 PAPERS • 4 BENCHMARKS
Veri-Wild is the largest vehicle re-identification dataset (as of CVPR 2019). The dataset is captured from a large CCTV surveillance system consisting of 174 cameras across one month (30× 24h) under unconstrained scenarios. This dataset comprises 416,314 vehicle images of 40,671 identities. Evaluation on this dataset is split across three subsets: small, medium and large; comprising 3000, 5000 and 10,000 identities respectively (in probe and gallery sets).
38 PAPERS • 3 BENCHMARKS
WildDeepfake is a dataset for real-world deepfakes detection which consists of 7,314 face sequences extracted from 707 deepfake videos that are collected completely from the internet. WildDeepfake is a small dataset that can be used, in addition to existing datasets, to develop more effective detectors against real-world deepfakes.
34 PAPERS • NO BENCHMARKS YET
This is the second version of the Google Landmarks dataset (GLDv2), which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and test
26 PAPERS • 4 BENCHMARKS
Consists of 20k English biomedical entity mentions from Reddit expert-annotated with links to SNOMED CT, a widely-used medical knowledge graph.
18 PAPERS • NO BENCHMARKS YET
Provides detailed, graph-based annotations of social situations depicted in movie clips. Each graph consists of several types of nodes, to capture who is present in the clip, their emotional and physical attributes, their relationships (i.e., parent/child), and the interactions between them. Most interactions are associated with topics that provide additional details, and reasons that give motivations for actions.
12 PAPERS • NO BENCHMARKS YET
The Airport dataset is a dataset for person re-identification which consists of 39,902 images and 9,651 identities across six cameras.
8 PAPERS • NO BENCHMARKS YET
Collected from top 10 most popular clothing/wearable brandname logos captured in rich visual context.
7 PAPERS • 1 BENCHMARK
The Hotels-50K dataset consists of over 1 million images from 50,000 different hotels around the world. These images come from both travel websites, as well as the TraffickCam mobile application, which allows every day travelers to submit images of their hotel room in order to help combat trafficking. The TraffickCam images are more visually similar to images from trafficking investigations than the images from travel websites.
6 PAPERS • NO BENCHMARKS YET
N-Digit MNIST is a multi-digit MNIST-like dataset.
5 PAPERS • NO BENCHMARKS YET
Goldfinch is a dataset for fine-grained recognition challenges. It contains a list of bird, butterfly, aircraft, and dog categories with relevant Google image search and Flickr search URLs. In addition, it also includes a set of active learning annotations on dog categories.
3 PAPERS • NO BENCHMARKS YET
Large Age-Gap (LAG) is a dataset for face verification, The dataset contains 3,828 images of 1,010 celebrities. For each identity at least one child/young image and one adult/old image are present.
The OpeReid dataset is a person re-identification dataset that consists of 7,413 images of 200 persons.
The BirdVox-full-night dataset contains 6 audio recordings, each about ten hours in duration. These recordings come from ROBIN autonomous recording units, placed near Ithaca, NY, USA during the fall 2015. They were captured on the night of September 23rd, 2015, by six different sensors, originally numbered 1, 2, 3, 5, 7, and 10. Andrew Farnsworth used the Raven software to pinpoint every avian flight call in time and frequency. He found 35402 flight calls in total. He estimates that about 25 different species of passerines (thrushes, warblers, and sparrows) are present in this recording. Species are not labeled in BirdVox-full-night, but it is possible to tell apart thrushes from warblers and sparrrows by looking at the center frequencies of their calls. The annotation process took 102 hours.
2 PAPERS • NO BENCHMARKS YET
DyML-Animal is based on animal images selected from ImageNet-5K [1]. It has 5 semantic scales (i.e., classes, order, family, genus, species) according to biological taxonomy. Specifically, there are 611 “species” for the fine level, 47 categories corresponding to “order”, “family” or “genus” for the middle level, and 5 “classes” for the coarse level. We note some animals have contradiction between visual perception and biological taxonomy, e.g., whale in “mammal” actually looks more similar to fish. Annotating the whale images as belonging to mammal would cause confusion to visual recognition. So we take a detailed check on potential contradictions and intentionally leave out those animals.
2 PAPERS • 1 BENCHMARK
DyML-Product is derived from iMaterialist-2019, a hierarchical online product dataset. The original iMaterialist-2019 offers up to 4 levels of hierarchical annotations. We remove the coarsest level and maintain 3 levels for DyML-Product.
DyML-Vehicle merges two vehicle re-ID datasets PKU VehicleID [1], VERI-Wild [1]. Since these two datasets have only annotations on the identity (fine) level, we manually annotate each image with “model” label (e.g., Toyota Camry, Honda Accord, Audi A4) and “body type” label (e.g., car, suv, microbus, pickup). Moreover, we label all the taxi images as a novel testing class under coarse level.
The Freiburg Spatial Relations dataset features 546 scenes each containing two out of 25 household objects. The depicted spatial relations can roughly be described as on top, on top on the corner, inside, inside and inclined, next to, and inclined. The dataset contains the 25 object models as textured .obj and .dae files, a low resolution .dae version for visualization in rviz, a scene description file containing the translation and rotation of the objects for each scene, a file with labels for each scene, the 15 splits used for cross validation, and a bash script to convert the models to pointclouds.
VideoForensicsHQ is a benchmark dataset for face video forgery detection, providing high quality visual manipulations. It is one of the first face video manipulation benchmark sets that also contains audio and thus complements existing datasets along a new challenging dimension. VideoForensicsHQ shows manipulations at much higher video quality and resolution, and shows manipulations that are provably much harder to detect by humans than videos in other datasets.
A View From Somewhere (AVFS)—a dataset of 638,180 face similarity judgments over 4,921 faces. Each judgment corresponds to the odd-one-out (i.e., least similar) face in a triplet of faces and is accompanied by both the identifier and demographic attributes of the annotator who made the judgment.
1 PAPER • NO BENCHMARKS YET
HAM is a dataset for molecular graph partitioning. This dataset contains coarse-grained (CG) mappings of 1206 organic molecules with less than 25 heavy atoms. Each molecule was downloaded from the PubChem database as SMILES. One molecule was assigned to two annotators to compare the human agreement between CG mappings. Downloaded SMILES were hand-mapped. The completed annotations were reviewed by a third person, to identify and remove unreasonable mappings (eg: one bead mappings) which did not agree with the given guidelines. Hence, there are 1.68 annotations per molecule in the current database (16% removed).
The Image-MusicEmotion-Matching-Net (IMEMNet) dataset is a dataset for continuous emotion-based image and music matching. It has over 140K image-music pairs.
The Live Comment Dataset is a large-scale dataset with 2,361 videos and 895,929 live comments that were written while the videos were streamed.
The data consists of 100,000 renderings each of the Bunny and Dragon objects from the Stanford 3D Scanning Repository. More objects may be added in the future, but only the Bunny and Dragon are used in the paper. Each object is rendered with a uniformly sampled illumination from a point on the 2-sphere, and a uniformly sampled 3D rotation. The true latent states are provided as NumPy arrays along with the images. The lighting is given as a 3-vector with unit norm, while the rotation is provided both as a quaternion and a 3x3 orthogonal matrix.
Tsinghua Dogs is a fine-grained classification dataset for dogs, over 65% of whose images are collected from people's real life. Each dog breed in the dataset contains at least 200 images and a maximum of 7,449 images, basically in proportion to their frequency of occurrence in China, so it significantly increases the diversity for each breed over existing dataset. Furthermore, Tsinghua Dogs annotated bounding boxes of the dog’s whole body and head in each image, which can be used for supervising the training of learning algorithms as well as testing them.