KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile robotics and autonomous driving. It consists of hours of traffic scenarios recorded with a variety of sensor modalities, including high-resolution RGB, grayscale stereo cameras, and a 3D laser scanner. Despite its popularity, the dataset itself does not contain ground truth for semantic segmentation. However, various researchers have manually annotated parts of the dataset to fit their necessities. Álvarez et al. generated ground truth for 323 images from the road detection challenge with three classes: road, vertical, and sky. Zhang et al. annotated 252 (140 for training and 112 for testing) acquisitions – RGB and Velodyne scans – from the tracking challenge for ten object categories: building, sky, road, vegetation, sidewalk, car, pedestrian, cyclist, sign/pole, and fence. Ros et al. labeled 170 training images and 46 testing images (from the visual odome
3,219 PAPERS • 141 BENCHMARKS
FlyingThings3D is a synthetic dataset for optical flow, disparity and scene flow estimation. It consists of everyday objects flying along randomized 3D trajectories. We generated about 25,000 stereo frames with ground truth data. Instead of focusing on a particular task (like KITTI) or enforcing strict naturalism (like Sintel), we rely on randomness and a large pool of rendering assets to generate orders of magnitude more data than any existing option, without running a risk of repetition or saturation.
215 PAPERS • NO BENCHMARKS YET
The Vimeo-90K is a large-scale high-quality video dataset for lower-level video processing. It proposes three different video processing tasks: frame interpolation, video denoising/deblocking, and video super-resolution.
192 PAPERS • 3 BENCHMARKS
MPI (Max Planck Institute) Sintel is a dataset for optical flow evaluation that has 1064 synthesized stereo images and ground truth data for disparity. Sintel is derived from open-source 3D animated short film Sintel. The dataset has 23 different scenes. The stereo images are RGB while the disparity is grayscale. Both have resolution of 1024×436 pixels and 8-bit per channel.
187 PAPERS • 6 BENCHMARKS
Virtual KITTI is a photo-realistic synthetic video dataset designed to learn and evaluate computer vision models for several video understanding tasks: object detection and multi-object tracking, scene-level and instance-level semantic segmentation, optical flow, and depth estimation.
120 PAPERS • 1 BENCHMARK
The MegaDepth dataset is a dataset for single-view depth prediction that includes 196 different locations reconstructed from COLMAP SfM/MVS.
115 PAPERS • NO BENCHMARKS YET
VisDrone is a large-scale benchmark with carefully annotated ground-truth for various important computer vision tasks, to make vision meet drones. The VisDrone2019 dataset is collected by the AISKYEYE team at Lab of Machine Learning and Data Mining, Tianjin University, China. The benchmark dataset consists of 288 video clips formed by 261,908 frames and 10,209 static images, captured by various drone-mounted cameras, covering a wide range of aspects including location (taken from 14 different cities separated by thousands of kilometers in China), environment (urban and country), objects (pedestrian, vehicles, bicycles, etc.), and density (sparse and crowded scenes). Note that, the dataset was collected using various drone platforms (i.e., drones with different models), in different scenarios, and under various weather and lighting conditions. These frames are manually annotated with more than 2.6 million bounding boxes of targets of frequent interests, such as pedestrians, cars, bicycl
61 PAPERS • 2 BENCHMARKS
The Event-Camera Dataset is a collection of datasets with an event-based camera for high-speed robotics. The data also include intensity images, inertial measurements, and ground truth from a motion-capture system. An event-based camera is a revolutionary vision sensor with three key advantages: a measurement rate that is almost 1 million times faster than standard cameras, a latency of 1 microsecond, and a high dynamic range of 130 decibels (standard cameras only have 60 dB). These properties enable the design of a new class of algorithms for high-speed robotics, where standard cameras suffer from motion blur and high latency. All the data are released both as text files and binary (i.e., rosbag) files.
44 PAPERS • 2 BENCHMARKS
A large real-world event-based dataset for object classification.
40 PAPERS • 2 BENCHMARKS
The Multi Vehicle Stereo Event Camera (MVSEC) dataset is a collection of data designed for the development of novel 3D perception algorithms for event based cameras. Stereo event data is collected from car, motorbike, hexacopter and handheld data, and fused with lidar, IMU, motion capture and GPS to provide ground truth pose and depth images.
25 PAPERS • 1 BENCHMARK
Spring is a large, high-resolution and high-detail, computer-generated benchmark for scene flow, optical flow, and stereo. Based on rendered scenes from the open-source Blender movie "Spring", it provides photo-realistic HD datasets with state-of-the-art visual effects and ground truth training data.
20 PAPERS • 3 BENCHMARKS
Provides a wide range of raw sensor data that is accessible on almost any modern-day smartphone together with a high-quality ground-truth track.
13 PAPERS • NO BENCHMARKS YET
SynWoodScape is a synthetic version of the surround-view dataset covering many of its weaknesses and extending it. WoodScape comprises four surround-view cameras and nine tasks, including segmentation, depth estimation, 3D bounding box detection, and a novel soiling detection. Semantic annotation of 40 classes at the instance level is provided for over 10,000 images. With WoodScape, we would like to encourage the community to adapt computer vision models for the fisheye camera instead of using naive rectification.
12 PAPERS • NO BENCHMARKS YET
A new and challenging video database of dynamic scenes that more than doubles the size of those previously available. This dataset is explicitly split into two subsets of equal size that contain videos with and without camera motion to allow for systematic study of how this variable interacts with the defining dynamics of the scene per se.
12 PAPERS • 1 BENCHMARK
SlowFlow is an optical flow dataset collected by applying Slow Flow technique on data from a high-speed camera and analyzing the performance of the state-of-the-art in optical flow under various levels of motion blur.
10 PAPERS • NO BENCHMARKS YET
The Blackbird unmanned aerial vehicle (UAV) dataset is a large-scale, aggressive indoor flight dataset collected using a custom-built quadrotor platform for use in evaluation of agile perception. The Blackbird dataset contains over 10 hours of flight data from 168 flights over 17 flight trajectories and 5 environments. Each flight includes sensor data from 120Hz stereo and downward-facing photorealistic virtual cameras, 100Hz IMU, motor speed sensors, and 360Hz millimeter-accurate motion capture ground truth. Camera images for each flight were photorealistically rendered using FlightGoggles across a variety of environments to facilitate easy experimentation of high performance perception algorithms.
9 PAPERS • NO BENCHMARKS YET
DSEC is a stereo camera dataset in driving scenarios that contains data from two monochrome event cameras and two global shutter color cameras in favorable and challenging illumination conditions. In addition, we collect Lidar data and RTK GPS measurements, both hardware synchronized with all camera data. One of the distinctive features of this dataset is the inclusion of VGA-resolution event cameras. Event cameras have received increasing attention for their high temporal resolution and high dynamic range performance. However, due to their novelty, event camera datasets in driving scenarios are rare. This work presents the first high-resolution, large-scale stereo dataset with event cameras.
9 PAPERS • 1 BENCHMARK
TUM-GAID (TUM Gait from Audio, Image and Depth) collects 305 subjects performing two walking trajectories in an indoor environment. The first trajectory is traversed from left to right and the second one from right to left. Two recording sessions were performed, one in January, where subjects wore heavy jackets and mostly winter boots, and another one in April, where subjects wore lighter clothes. The action is captured by a Microsoft Kinect sensor which provides a video stream with a resolution of 640×480 pixels and a frame rate around 30 FPS.
The SAMM Long Videos dataset consists of 147 long videos with 343 macro-expressions and 159 micro-expressions. The dataset is FACS-coded with detailed Action Units.
8 PAPERS • 1 BENCHMARK
A large-scale synthetic dataset with 2.5 Million photo-realistic images of 80 subjects performing 70 activities and wearing diverse outfits.
6 PAPERS • NO BENCHMARKS YET
OmniFlow is a synthetic omnidirectional human optical flow dataset. Based on a rendering engine the authors created a naturalistic 3D indoor environment with textured rooms, characters, actions, objects, illumination and motion blur where all components of the environment are shuffled during the data capturing process. The simulation has as output rendered images of household activities and the corresponding forward and backward optical flow. The dataset consists of 23,653 image pairs and corresponding forward and backward optical flow.
The TUB CrowdFlow is a synthetic dataset that contains 10 sequences showing 5 scenes. Each scene is rendered twice: with a static point of view and a dynamic camera to simulate drone/UAV based surveillance. The scenes are render using Unreal Engine at HD resolution (1280x720) at 25 fps, which is typical for current commercial CCTV surveillance systems. The total number of frames is 3200.
4 PAPERS • NO BENCHMARKS YET
Optical Flow in challenging scenes with gyroscope readings!
QUVA Repetition dataset consists of 100 videos displaying a wide variety of repetitive video dynamics, including swimming, stirring, cutting, combing and music-making. All videos have been annotated with individual cycle bounds and a total repetition count.
A large collection of interaction-rich video data which are annotated and analyzed.
RobotPush is a dataset for object singulation – the task of separating cluttered objects through physical interaction. The dataset contains 3456 training images with labels and 1024 validation images with labels. It consists of simulated and real-world data collected from a PR2 robot that equipped with a Kinect 2 camera. The dataset also contains ground truth instance segmentation masks for 110 images in the test set.
3 PAPERS • NO BENCHMARKS YET
An open source Multi-View Overhead Imagery dataset with 27 unique looks from a broad range of viewing angles (-32.5 degrees to 54.0 degrees). Each of these images cover the same 665 square km geographic extent and are annotated with 126,747 building footprint labels, enabling direct assessment of the impact of viewpoint perturbation on model performance.
Includes 3000 animated sequences rendered using styles randomly selected from 40 textured line styles and 38 shading styles, spanning the range between flat cartoon fill and wildly sketchy shading. The dataset includes 124K+ train set frames and 10K test set frames rendered at 1500x1500 resolution, far surpassing the largest available optical flow datasets in size.
2 PAPERS • NO BENCHMARKS YET
EDEN (Enclosed garDEN) is a multimodal synthetic dataset, a dataset for nature-oriented applications. The dataset features more than 300K images captured from more than 100 garden models. Each image is annotated with various low/high-level vision modalities, including semantic segmentation, depth, surface normals, intrinsic colors, and optical flow.
An autnonomous driving dataset and benchmark for optical flow. This dataset was created by the Heidelberg Collaboratory for Image Processing in close cooperation with Robert Bosch GmbH.
Two datasets (synthetic and natural/real) containing simultaneously recorded egocentric and exocentric videos.
A satellite-based dataset called "CloudCast". It consists of 70080 images with 10 different cloud types for multiple layers of the atmosphere annotated on a pixel level. The spatial resolution of the dataset is 928 × 1530 pixels (3 × 3 km per pixel) with 15-min intervals between frames for the period January 1, 2017, to December 31, 2018. All frames are centered and projected over Europe.
1 PAPER • NO BENCHMARKS YET
We use Kubric and ESIM simulator to make our EKubric dataset, which has 15,367 RGB-PointCloud-Event pairs with annotations (including optical flow, scene flow, surface normal, semantic segmentation and object coordinates ground truths).