Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense pixel annotations for 30 classes grouped into 8 categories (flat surfaces, humans, vehicles, constructions, objects, nature, sky, and void). The dataset consists of around 5000 fine annotated images and 20000 coarse annotated ones. Data was captured in 50 cities during several months, daytimes, and good weather conditions. It was originally recorded as video so the frames were manually selected to have the following features: large number of dynamic objects, varying scene layout, and varying background.
3,323 PAPERS • 54 BENCHMARKS
KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile robotics and autonomous driving. It consists of hours of traffic scenarios recorded with a variety of sensor modalities, including high-resolution RGB, grayscale stereo cameras, and a 3D laser scanner. Despite its popularity, the dataset itself does not contain ground truth for semantic segmentation. However, various researchers have manually annotated parts of the dataset to fit their necessities. Álvarez et al. generated ground truth for 323 images from the road detection challenge with three classes: road, vertical, and sky. Zhang et al. annotated 252 (140 for training and 112 for testing) acquisitions – RGB and Velodyne scans – from the tracking challenge for ten object categories: building, sky, road, vegetation, sidewalk, car, pedestrian, cyclist, sign/pole, and fence. Ros et al. labeled 170 training images and 46 testing images (from the visual odome
3,219 PAPERS • 141 BENCHMARKS
ScanNet is an instance-level indoor RGB-D dataset that includes both 2D and 3D data. It is a collection of labeled voxels rather than points or objects. Up to now, ScanNet v2, the newest version of ScanNet, has collected 1513 annotated scans with an approximate 90% surface coverage. In the semantic segmentation task, this dataset is marked in 20 classes of annotated 3D voxelized objects.
1,240 PAPERS • 19 BENCHMARKS
The NYU-Depth V2 data set is comprised of video sequences from a variety of indoor scenes as recorded by both the RGB and Depth cameras from the Microsoft Kinect. It features:
841 PAPERS • 20 BENCHMARKS
The Matterport3D dataset is a large RGB-D dataset for scene understanding in indoor environments. It contains 10,800 panoramic views inside 90 real building-scale scenes, constructed from 194,400 RGB-D images. Each scene is a residential building consisting of multiple rooms and floor levels, and is annotated with surface construction, camera poses, and semantic segmentation.
379 PAPERS • 5 BENCHMARKS
The Middlebury Stereo dataset consists of high-resolution stereo sequences with complex geometry and pixel-accurate ground-truth disparity data. The ground-truth disparities are acquired using a novel technique that employs structured lighting and does not require the calibration of the light projectors.
204 PAPERS • 5 BENCHMARKS
TUM RGB-D is an RGB-D dataset. It contains the color and depth images of a Microsoft Kinect sensor along the ground-truth trajectory of the sensor. The data was recorded at full frame rate (30 Hz) and sensor resolution (640x480). The ground-truth trajectory was obtained from a high-accuracy motion-capture system with eight high-speed tracking cameras (100 Hz).
189 PAPERS • NO BENCHMARKS YET
SUNCG is a large-scale dataset of synthetic 3D scenes with dense volumetric annotations.
181 PAPERS • NO BENCHMARKS YET
Taskonomy provides a large and high-quality dataset of varied indoor scenes.
132 PAPERS • 2 BENCHMARKS
The 2D-3D-S dataset provides a variety of mutually registered modalities from 2D, 2.5D and 3D domains, with instance-level semantic and geometric annotations. It covers over 6,000 m2 collected in 6 large-scale indoor areas that originate from 3 different buildings. It contains over 70,000 RGB images, along with the corresponding depths, surface normals, semantic annotations, global XYZ images (all in forms of both regular and 360° equirectangular images) as well as camera information. It also includes registered raw and semantically annotated 3D meshes and point clouds. The dataset enables development of joint and cross-modal learning models and potentially unsupervised approaches utilizing the regularities present in large-scale indoor spaces.
129 PAPERS • 8 BENCHMARKS
The Make3D dataset is a monocular Depth Estimation dataset that contains 400 single training RGB and depth map pairs, and 134 test samples. The RGB images have high resolution, while the depth maps are provided at low resolution.
122 PAPERS • 1 BENCHMARK
Virtual KITTI is a photo-realistic synthetic video dataset designed to learn and evaluate computer vision models for several video understanding tasks: object detection and multi-object tracking, scene-level and instance-level semantic segmentation, optical flow, and depth estimation.
120 PAPERS • 1 BENCHMARK
The MegaDepth dataset is a dataset for single-view depth prediction that includes 196 different locations reconstructed from COLMAP SfM/MVS.
115 PAPERS • NO BENCHMARKS YET
SUN3D contains a large-scale RGB-D video database, with 8 annotated sequences. Each frame has a semantic segmentation of the objects in the scene and information about the camera pose. It is composed by 415 sequences captured in 254 different spaces, in 41 different buildings. Moreover, some places have been captured multiple times at different moments of the day.
114 PAPERS • NO BENCHMARKS YET
ETHD is a multi-view stereo benchmark / 3D reconstruction benchmark that covers a variety of indoor and outdoor scenes. Ground truth geometry has been obtained using a high-precision laser scanner. A DSLR camera as well as a synchronized multi-camera rig with varying field-of-view was used to capture images.
79 PAPERS • 1 BENCHMARK
For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real images. Hypersim is a photorealistic synthetic dataset for holistic indoor scene understanding. It contains 77,400 images of 461 indoor scenes with detailed per-pixel labels and corresponding ground truth geometry.
59 PAPERS • 1 BENCHMARK
Diode Dense Indoor/Outdoor DEpth (DIODE) is the first standard dataset for monocular depth estimation comprising diverse indoor and outdoor scenes acquired with the same hardware setup. The training set consists of 8574 indoor and 16884 outdoor samples from 20 scans each. The validation set contains 325 indoor and 446 outdoor samples with each set from 10 different scans. The ground truth density for the indoor training and validation splits are approximately 99.54% and 99%, respectively. The density of the outdoor sets are naturally lower with 67.19% for training and 78.33% for validation subsets. The indoor and outdoor ranges for the dataset are 50m and 300m, respectively.
57 PAPERS • 2 BENCHMARKS
DDAD is a new autonomous driving benchmark from TRI (Toyota Research Institute) for long range (up to 250m) and dense depth estimation in challenging and diverse urban conditions. It contains monocular videos and accurate ground-truth depth (across a full 360 degree field of view) generated from high-density LiDARs mounted on a fleet of self-driving cars operating in a cross-continental setting. DDAD contains scenes from urban settings in the United States (San Francisco, Bay Area, Cambridge, Detroit, Ann Arbor) and Japan (Tokyo, Odaiba).
54 PAPERS • 1 BENCHMARK
The Middlebury 2014 dataset contains a set of 23 high resolution stereo pairs for which known camera calibration parameters and ground truth disparity maps obtained with a structured light scanner are available. The images in the Middlebury dataset all show static indoor scenes with varying difficulties including repetitive structures, occlusions, wiry objects as well as untextured areas.
51 PAPERS • 2 BENCHMARKS
DrivingStereo contains over 180k images covering a diverse set of driving scenarios, which is hundreds of times larger than the KITTI Stereo dataset. High-quality labels of disparity are produced by a model-guided filtering strategy from multi-frame LiDAR points.
41 PAPERS • NO BENCHMARKS YET
DENSE (Depth Estimation oN Synthetic Events) is a new dataset with synthetic events and perfect ground truth.
36 PAPERS • 1 BENCHMARK
Virtual KITTI 2 is an updated version of the well-known Virtual KITTI dataset which consists of 5 sequence clones from the KITTI tracking benchmark. In addition, the dataset provides different variants of these sequences such as modified weather conditions (e.g. fog, rain) or modified camera configurations (e.g. rotated by 15◦). For each sequence we provide multiple sets of images containing RGB, depth, class segmentation, instance segmentation, flow, and scene flow data. Camera parameters and poses as well as vehicle locations are available as well. In order to showcase some of the dataset’s capabilities, we ran multiple relevant experiments using state-of-the-art algorithms from the field of autonomous driving. The dataset is available for download at https://europe.naverlabs.com/Research/Computer-Vision/Proxy-Virtual-Worlds.
33 PAPERS • 1 BENCHMARK
The MannequinChallenge Dataset (MQC) provides in-the-wild videos of people in static poses while a hand-held camera pans around the scene. The dataset consists of three splits for training, validation and testing.
26 PAPERS • NO BENCHMARKS YET
2D-3D Match Dataset is a new dataset of 2D-3D correspondences by leveraging the availability of several 3D datasets from RGB-D scans. Specifically, the data from SceneNN and 3DMatch are used. The training dataset consists of 110 RGB-D scans, of which 56 scenes are from SceneNN and 54 scenes are from 3DMatch. The 2D-3D correspondence data is generated as follows. Given a 3D point which is randomly sampled from a 3D point cloud, a set of 3D patches from different scanning views are extracted. To find a 2D-3D correspondence, for each 3D patch, its 3D position is re-projected into all RGB-D frames for which the point lies in the camera frustum, taking occlusion into account. The corresponding local 2D patches around the re-projected point are extracted. In total, around 1.4 millions 2D-3D correspondences are collected.
25 PAPERS • NO BENCHMARKS YET
A dataset for single-image 3D in the wild consisting of annotations of detailed 3D geometry for 140,000 images.
23 PAPERS • 2 BENCHMARKS
The ReDWeb dataset consists of 3600 RGB-RD image pairs collected from the Web. This dataset covers a wide range of scenes and features various non-rigid objects.
22 PAPERS • NO BENCHMARKS YET
The dataset was collected using the Intel RealSense D435i camera, which was configured to produce synchronized accelerometer and gyroscope measurements at 400 Hz, along with synchronized VGA-size (640 x 480) RGB and depth streams at 30 Hz. The depth frames are acquired using active stereo and is aligned to the RGB frame using the sensor factory calibration. All the measurements are timestamped.
19 PAPERS • 1 BENCHMARK
Collects high quality 360 datasets with ground truth depth annotations, by re-using recently released large scale 3D datasets and re-purposing them to 360 via rendering.
14 PAPERS • NO BENCHMARKS YET
The KITTI-Depth dataset includes depth maps from projected LiDAR point clouds that were matched against the depth estimation from the stereo cameras. The depth images are highly sparse with only 5% of the pixels available and the rest is missing. The dataset has 86k training images, 7k validation images, and 1k test set images on the benchmark server with no access to the ground truth.
The Web Stereo Video Dataset consists of 553 stereoscopic videos from YouTube. This dataset has a wide variety of scene types, and features many nonrigid objects.
12 PAPERS • NO BENCHMARKS YET
The HRWSI dataset consists of about 21K diverse high-resolution RGB-D image pairs derived from the Web stereo images. Also, it provides sky segmentation masks, instance segmentation masks as well as invalid pixel masks.
10 PAPERS • NO BENCHMARKS YET
Depth in the Wild is a dataset for single-image depth perception in the wild, i.e., recovering depth from a single image taken in unconstrained settings. It consists of images in the wild annotated with relative depth between pairs of random points.
9 PAPERS • NO BENCHMARKS YET
An in-the-wild stereo image dataset, comprising 49,368 image pairs contributed by users of the Holopix mobile social platform.
HUMAN4D is a large and multimodal 4D dataset that contains a variety of human activities simultaneously captured by a professional marker-based MoCap, a volumetric capture and an audio recording system. By capturing 2 female and $2$ male professional actors performing various full-body movements and expressions, HUMAN4D provides a diverse set of motions and poses encountered as part of single- and multi-person daily, physical and social activities (jumping, dancing, etc. ), along with multi-RGBD (mRGBD), volumetric and audio data.
8 PAPERS • NO BENCHMARKS YET
The Stanford Light Field Archive is a collection of several light fields for research in computer graphics and vision.
7 PAPERS • NO BENCHMARKS YET
DurLAR is a high-fidelity 128-channel 3D LiDAR dataset with panoramic ambient (near infrared) and reflectivity imagery for multi-modal autonomous driving applications. Compared to existing autonomous driving task datasets, DurLAR has the following novel features:
5 PAPERS • NO BENCHMARKS YET
The Middlebury 2006 is a stereo dataset of indoor scenes with multiple handcrafted layouts.
SYNS-Patches dataset, which is a subset of SYNS. The original SYNS is composed of aligned image and LiDAR panoramas from 92 different scenes belonging to a wide variety of environments, such as Agriculture, Natural (e.g. forests and fields), Residential, Industrial and Indoor. It represents the subset of patches from each scene extracted at eye level at 20 degree intervals of a full horizontal rotation. This results in 18 images per scene and a total dataset size of 1656.
CocoDoom is a collection of pre-recorded data extracted from Doom gaming sessions along with annotations in the MS Coco format.
4 PAPERS • NO BENCHMARKS YET
The DCM dataset is composed of 772 annotated images from 27 golden age comic books. We freely collected them from the free public domain collection of digitized comic books Digital Comics Museum. One album per available publisher was selected to get as many different styles as possible. We made ground-truth bounding boxes of all panels, all characters (body + faces), small or big, human-like or animal-like.
4 PAPERS • 3 BENCHMARKS
Endoscopic stereo reconstruction for surgical scenes gives rise to specific problems, including the lack of clear corner features, highly specular surface properties, and the presence of blood and smoke. These issues present difficulties for both stereo reconstruction itself and also for standardised dataset production. We present a stereo-endoscopic reconstruction validation dataset based on cone-beam CT (SERV-CT). Two ex vivo small porcine full torso cadavers were placed within the view of the endoscope with both the endoscope and target anatomy visible in the CT scan. Subsequent orientation of the endoscope was manually aligned to match the stereoscopic view and benchmark disparities, depths and occlusions are calculated. The requirement of a CT scan limited the number of stereo pairs to 8 from each ex vivo sample. For the second sample an RGB surface was acquired to aid alignment of smooth, featureless surfaces. Repeated manual alignments showed an RMS disparity accuracy of around
Provides a large-scale synthetic dataset which contains accurate ground truth depth of various photo-realistic scenes.
3 PAPERS • NO BENCHMARKS YET
Dynamic Replica is a synthetic dataset of stereo videos featuring humans and animals in virtual environments. It is a benchmark for dynamic disparity/depth estimation and 3D reconstruction consisting of 145,200 stereo frames (524 videos).
The endoscopic SLAM dataset (EndoSLAM) is a dataset for depth estimation approach for endoscopic videos. It consists of both ex-vivo and synthetically generated data. The ex-vivo part of the dataset includes standard as well as capsule endoscopy recordings. The dataset is divided into 35 sub-datasets. Specifically, 18, 5 and 12 sub-datasets exist for colon, small intestine and stomach respectively.
We present a new large-scale photorealistic panoramic dataset named FutureHouse, which has the following characteristics. 1) It contains over 70,000 high-quality models with high-resolution meshes and physical material. All models are measured in real world standards. 2) Selected scene layouts are carefully designed by over 100 excellent artists. All of selected layouts are used in realworld display. 3) It contains 28,579 good panoramic views from 1,752 house-scale scenes. Therefore, it can be used for perspective image tasks as well as omnidirectional image tasks. 4) More physical material representation. Most materials are represent by microfacet BRDF modeling metalness, and the rest are represent by special shading models, e.g., cloth material and transmission material. 5) High rendering quality. Benefiting from commercial rendering engine, Unreal engine 4, and powerful deep learning super sampling (DLSS), our renderings have less noise. Our SVBRDF rep
We present a large-scale dataset for 3D urban scene understanding. Compared to existing datasets, our dataset consists of 75 outdoor urban scenes with diverse backgrounds, encompassing over 15,000 images. These scenes offer 360◦ hemispherical views, capturing diverse foreground objects illuminated under various lighting conditions. Additionally, our dataset encompasses scenes that are not limited to forward-driving views, addressing the limitations of previous datasets such as limited overlap and coverage between camera views. The closest pre-existing dataset for generalizable evaluation is DTU [2] (80 scenes) which comprises mostly indoor objects and does not provide multiple foreground objects or background scenes.
3 PAPERS • 1 BENCHMARK
Aa new cross-season scaleless monocular depth prediction dataset from CMU Visual Localization dataset through structure from motion.
The UASOL an RGB-D stereo dataset, that contains 160902 frames, filmed at 33 different scenes, each with between 2 k and 10 k frames. The frames show different paths from the perspective of a pedestrian, including sidewalks, trails, roads, etc. The images were extracted from video files with 15 fps at HD2K resolution with a size of 2280 × 1282 pixels. The dataset also provides a GPS geolocalization tag for each second of the sequences and reflects different climatological conditions. It also involved up to 4 different persons filming the dataset at different moments of the day.