The CIFAR-10 dataset (Canadian Institute for Advanced Research, 10 classes) is a subset of the Tiny Images dataset and consists of 60000 32x32 color images. The images are labelled with one of 10 mutually exclusive classes: airplane, automobile (but not truck or pickup truck), bird, cat, deer, dog, frog, horse, ship, and truck (but not pickup truck). There are 6000 images per class with 5000 training and 1000 testing images per class.
14,087 PAPERS • 98 BENCHMARKS
The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists of 60000 32x32 color images. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. There are 600 images per class. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). There are 500 training images and 100 testing images per class.
7,653 PAPERS • 52 BENCHMARKS
The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or self-taught learning. Besides 100,000 unlabeled images, it contains 13,000 labeled images from 10 object classes (such as birds, cats, trucks), among which 5,000 images are partitioned for training while the remaining 8,000 images for testing. All the images are color images with 96×96 pixels in size.
958 PAPERS • 17 BENCHMARKS
Tiny ImageNet contains 100000 images of 200 classes (500 for each class) downsized to 64×64 colored images. Each class has 500 training images, 50 validation images and 50 test images.
942 PAPERS • 8 BENCHMARKS
The 2D-3D-S dataset provides a variety of mutually registered modalities from 2D, 2.5D and 3D domains, with instance-level semantic and geometric annotations. It covers over 6,000 m2 collected in 6 large-scale indoor areas that originate from 3 different buildings. It contains over 70,000 RGB images, along with the corresponding depths, surface normals, semantic annotations, global XYZ images (all in forms of both regular and 360° equirectangular images) as well as camera information. It also includes registered raw and semantically annotated 3D meshes and point clouds. The dataset enables development of joint and cross-modal learning models and potentially unsupervised approaches utilizing the regularities present in large-scale indoor spaces.
129 PAPERS • 8 BENCHMARKS
NSynth is a dataset of one shot instrumental notes, containing 305,979 musical notes with unique pitch, timbre and envelope. The sounds were collected from 1006 instruments from commercial sample libraries and are annotated based on their source (acoustic, electronic or synthetic), instrument family and sonic qualities. The instrument families used in the annotation are bass, brass, flute, guitar, keyboard, mallet, organ, reed, string, synth lead and vocal. Four second monophonic 16kHz audio snippets were generated (notes) for the instruments.
121 PAPERS • 3 BENCHMARKS
AVA is a project that provides audiovisual annotations of video for improving our understanding of human activity. Each of the video clips has been exhaustively annotated by human annotators, and together they represent a rich variety of scenes, recording conditions, and expressions of human activity. There are annotations for:
94 PAPERS • 7 BENCHMARKS
Rendered synthetically using a library of standard 3D objects, and tests the ability to recognize compositions of object movements that require long-term reasoning.
47 PAPERS • 3 BENCHMARKS
This dataset includes time-series data generated by accelerometer and gyroscope sensors (attitude, gravity, userAcceleration, and rotationRate). It is collected with an iPhone 6s kept in the participant's front pocket using SensingKit which collects information from Core Motion framework on iOS devices. All data is collected in 50Hz sample rate. A total of 24 participants in a range of gender, age, weight, and height performed 6 activities in 15 trials in the same environment and conditions: downstairs, upstairs, walking, jogging, sitting, and standing.
29 PAPERS • NO BENCHMARKS YET
Semi-Supervised Object Detection on COCO 10% labeled data
28 PAPERS • 2 BENCHMARKS
Contains 349 COVID-19 CT images from 216 patients and 463 non-COVID-19 CTs. The utility of this dataset is confirmed by a senior radiologist who has been diagnosing and treating COVID-19 patients since the outbreak of this pandemic.
27 PAPERS • NO BENCHMARKS YET
The Multi Vehicle Stereo Event Camera (MVSEC) dataset is a collection of data designed for the development of novel 3D perception algorithms for event based cameras. Stereo event data is collected from car, motorbike, hexacopter and handheld data, and fused with lidar, IMU, motion capture and GPS to provide ground truth pose and depth images.
25 PAPERS • 1 BENCHMARK
Gibson is an opensource perceptual and physics simulator to explore active and real-world perception. The Gibson Environment is used for Real-World Perception Learning.
21 PAPERS • NO BENCHMARKS YET
CREMA-D is an emotional multimodal actor data set of 7,442 original clips from 91 actors. These clips were from 48 male and 43 female actors between the ages of 20 and 74 coming from a variety of races and ethnicities (African America, Asian, Caucasian, Hispanic, and Unspecified).
20 PAPERS • 7 BENCHMARKS
Contains temporally labeled face tracks in video, where each face instance is labeled as speaking or not, and whether the speech is audible. This dataset contains about 3.65 million human labeled frames or about 38.5 hours of face tracks, and the corresponding audio.
19 PAPERS • 1 BENCHMARK
MosMedData contains anonymised human lung computed tomography (CT) scans with COVID-19 related findings, as well as without such findings. A small subset of studies has been annotated with binary pixel masks depicting regions of interests (ground-glass opacifications and consolidations). CT scans were obtained between 1st of March, 2020 and 25th of April, 2020, and provided by municipal hospitals in Moscow, Russia.
Consists of 20k English biomedical entity mentions from Reddit expert-annotated with links to SNOMED CT, a widely-used medical knowledge graph.
18 PAPERS • NO BENCHMARKS YET
TV show Caption is a large-scale multimodal captioning dataset, containing 261,490 caption descriptions paired with 108,965 short video moments. TVC is unique as its captions may also describe dialogues/subtitles while the captions in the other datasets are only describing the visual content.
15 PAPERS • 1 BENCHMARK
The Argoverse 2 Sensor Dataset is a collection of 1,000 scenarios with 3D object tracking annotations. Each sequence in our training and validation sets includes annotations for all objects within five meters of the “drivable area” — the area in which it is possible for a vehicle to drive. The HD map for each scenario specifies the driveable area.
7 PAPERS • NO BENCHMARKS YET
Source: BARThez: a Skilled Pretrained French Sequence-to-Sequence Model
7 PAPERS • 3 BENCHMARKS
YUD+ is a dataset containing additional Vanishing Point Labels for the York Urban Database.
6 PAPERS • NO BENCHMARKS YET
The 3DSeg-8 is a collection of several publicly available 3D segmentation datasets from different medical imaging modalities, e.g. magnetic resonance imaging (MRI) and computed tomography (CT), with various scan regions, target organs and pathologies.
5 PAPERS • NO BENCHMARKS YET
ACAV100M processes 140 million full-length videos (total duration 1,030 years) which are used to produce a dataset of 100 million 10-second clips (31 years) with high audio-visual correspondence. This is two orders of magnitude larger than the current largest video dataset used in the audio-visual learning literature, i.e., AudioSet (8 months), and twice as large as the largest video dataset in the literature, i.e., HowTo100M (15 years).
CLUECorpus2020 is a large-scale corpus that can be used directly for self-supervised learning such as pre-training of a language model, or language generation. It has 100G raw corpus with 35 billion Chinese characters, which is retrieved from Common Crawl.
StreetStyle is a large-scale dataset of photos of people annotated with clothing attributes, and use this dataset to train attribute classifiers via deep learning.
DABS is a domain-agnostic benchmark for self-supervised learning to encourage research and progress towards domain-agnostic methods.
4 PAPERS • 1 BENCHMARK
The LIMUC dataset is the largest publicly available labeled ulcerative colitis dataset that compromises 11276 images from 564 patients and 1043 colonoscopy procedures. Three experienced gastroenterologists were involved in the annotation process, and all images are labeled according to the Mayo endoscopic score (MES).
NYU-VP is a new dataset for multi-model fitting, vanishing point (VP) estimation in this case. Each image is annotated with up to eight vanishing points, and pre-extracted line segments are provided which act as data points for a robust estimator. Due to its size, the dataset is the first to allow for supervised learning of a multi-model fitting task.
4 PAPERS • NO BENCHMARKS YET
Wild-Time is a benchmark of 5 datasets that reflect temporal distribution shifts arising in a variety of real-world applications, including patient prognosis and news classification. On these datasets, we systematically benchmark 13 prior approaches, including methods in domain generalization, continual learning, self-supervised learning, and ensemble learning.
The Argoverse 2 Lidar Dataset is a collection of 20,000 scenarios with lidar sensor data, HD maps, and ego-vehicle pose. It does not include imagery or 3D annotations. The dataset is designed to support research into self-supervised learning in the lidar domain, as well as point cloud forecasting.
3 PAPERS • NO BENCHMARKS YET
Unsupervised Domain Adaptation demonstrates great potential to mitigate domain shifts by transferring models from labeled source domains to unlabeled target domains. While Unsupervised Domain Adaptation has been applied to a wide variety of complex vision tasks, only few works focus on lane detection for autonomous driving. This can be attributed to the lack of publicly available datasets. To facilitate research in these directions, we propose CARLANE, a 3-way sim-to-real domain adaptation benchmark for 2D lane detection. CARLANE encompasses the single-target datasets MoLane and TuLane and the multi-target dataset MuLane. These datasets are built from three different domains, which cover diverse scenes and contain a total of 163K unique images, 118K of which are annotated. In addition we evaluate and report systematic baselines, including our own method, which builds upon Prototypical Cross-domain Self-supervised Learning. We find that false positive and false negative rates of the eva
3 PAPERS • 3 BENCHMARKS
DCASE2014 is an audio classification benchmark.
SSL4EO-S12 is a large-scale, global, multimodal, and multi-seasonal corpus of satellite imagery from the ESA Sentinel-1 & -2 satellite missions.
Extended Agriculture-Vision dataset comprises two parts:
2 PAPERS • NO BENCHMARKS YET
An experimental and synthetic (simulated) OA raw signals and reconstructed image domain datasets rendered with different experimental parameters and tomographic acquisition geometries.
A classification dataset of radar spectrograms in i "ground surveillance" setting recorded with the Open Radar Initiative. A dataset in a "ground surveillance" setting. The dataset has been collected with a stationary radar and targets moving in front of the radar. The dataset has been collected using both collaborative and non-collaborative targets.
The Unified SSL Benchmark (USB) consists of 15 diverse, challenging, and comprehensive tasks from CV, natural language processing (NLP), and audio processing (Audio) to evaluate self-supervised learning (SSL) methods. A modular and extensible codebase is open-sourced for fair evaluation on these SSL methods.
Contains a large number of online videos and subtitles.
1 PAPER • NO BENCHMARKS YET
The Argoverse 2 Map Change Dataset is a collection of 1,000 scenarios with ring camera imagery, lidar, and HD maps. Two hundred of the scenarios include changes in the real-world environment that are not yet reflected in the HD map, such as new crosswalks or repainted lanes. By sharing a map dataset that labels the instances in which there are discrepancies with sensor data, we encourage the development of novel methods for detecting out-of-date map regions.
The Sentinel-2 satellite carries 12 CMOS detectors for the VNIR bands, with adjacent detectors having overlapping fields of view that result in overlapping regions in level-1 B (L1B) images. This dataset includes 3740 pairs of overlapping image crops extracted from two L1B products. Each crop has a height of around 400 pixels and a variable width that depends on the overlap width between detectors for RGBN bands, typically around 120-200 pixels. In addition to detector parallax, there is also cross-band parallax for each detector, resulting in shifts between bands. Pre-registration is performed for both cross-band and cross-detector parallax, with a precision of up to a few pixels (typically less than 10 pixels).
The scales of the data accessible through internet search engines can reach hundreds of millions, or even billions. The existence of such large weak-labeled databases has gained importance in the training of face recognition algorithms. Starting with the publicly available YFCC100M, we propose a weakly-labeled subset for multi-label face recognition for self-supervised methods. A 392K image subset of YFCC100M of 128x128 images was obtained by querying for the 40 facial attributes. We made this dataset publicly available.