KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile robotics and autonomous driving. It consists of hours of traffic scenarios recorded with a variety of sensor modalities, including high-resolution RGB, grayscale stereo cameras, and a 3D laser scanner. Despite its popularity, the dataset itself does not contain ground truth for semantic segmentation. However, various researchers have manually annotated parts of the dataset to fit their necessities. Álvarez et al. generated ground truth for 323 images from the road detection challenge with three classes: road, vertical, and sky. Zhang et al. annotated 252 (140 for training and 112 for testing) acquisitions – RGB and Velodyne scans – from the tracking challenge for ten object categories: building, sky, road, vegetation, sidewalk, car, pedestrian, cyclist, sign/pole, and fence. Ros et al. labeled 170 training images and 46 testing images (from the visual odome
3,219 PAPERS • 141 BENCHMARKS
The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in Boston and Singapore. Each scene is 20 seconds long and annotated at 2Hz. This results in a total of 28130 samples for training, 6019 samples for validation and 6008 samples for testing. The dataset has the full autonomous vehicle data suite: 32-beam LiDAR, 6 cameras and radars with complete 360° coverage. The 3D object detection challenge evaluates the performance on 10 classes: cars, trucks, buses, trailers, construction vehicles, pedestrians, motorcycles, bicycles, traffic cones and barriers.
1,549 PAPERS • 20 BENCHMARKS
ScanNet is an instance-level indoor RGB-D dataset that includes both 2D and 3D data. It is a collection of labeled voxels rather than points or objects. Up to now, ScanNet v2, the newest version of ScanNet, has collected 1513 annotated scans with an approximate 90% surface coverage. In the semantic segmentation task, this dataset is marked in 20 classes of annotated 3D voxelized objects.
1,240 PAPERS • 19 BENCHMARKS
The NYU-Depth V2 data set is comprised of video sequences from a variety of indoor scenes as recorded by both the RGB and Depth cameras from the Microsoft Kinect. It features:
841 PAPERS • 20 BENCHMARKS
The SUN RGBD dataset contains 10335 real RGB-D images of room scenes. Each RGB image has a corresponding depth and segmentation map. As many as 700 object categories are labeled. The training and testing sets contain 5285 and 5050 images, respectively.
422 PAPERS • 13 BENCHMARKS
The Stanford 3D Indoor Scene Dataset (S3DIS) dataset contains 6 large-scale indoor areas with 271 rooms. Each point in the scene point cloud is annotated with one of the 13 semantic categories.
421 PAPERS • 10 BENCHMARKS
The Waymo Open Dataset is comprised of high resolution sensor data collected by autonomous vehicles operated by the Waymo Driver in a wide variety of conditions.
373 PAPERS • 12 BENCHMARKS
Argoverse is a tracking benchmark with over 30K scenarios collected in Pittsburgh and Miami. Each scenario is a sequence of frames sampled at 10 HZ. Each sequence has an interesting object called “agent”, and the task is to predict the future locations of agents in a 3 seconds future horizon. The sequences are split into training, validation and test sets, which have 205,942, 39,472 and 78,143 sequences respectively. These splits have no geographical overlap.
322 PAPERS • 6 BENCHMARKS
Argoverse 2 (AV2) is a collection of three datasets for perception and forecasting research in the self-driving domain. The annotated Sensor Dataset contains 1,000 sequences of multimodal data, encompassing high-resolution imagery from seven ring cameras, and two stereo cameras in addition to lidar point clouds, and 6-DOF map-aligned pose. Sequences contain 3D cuboid annotations for 26 object categories, all of which are sufficiently-sampled to support training and evaluation of 3D perception models. The Lidar Dataset contains 20,000 sequences of unlabeled lidar point clouds and map-aligned pose. This dataset is the largest ever collection of lidar sensor data and supports self-supervised learning and the emerging task of point cloud forecasting. Finally, the Motion Forecasting Dataset contains 250,000 scenarios mined for interesting and challenging interactions be- tween the autonomous vehicle and other actors in each local scene. Models are tasked with the prediction of future motion
108 PAPERS • 3 BENCHMARKS
ONCE (One millioN sCenEs) is a dataset for 3D object detection in the autonomous driving scenario. The ONCE dataset consists of 1 million LiDAR scenes and 7 million corresponding camera images. The data is selected from 144 driving hours, which is 20x longer than other 3D autonomous driving datasets available like nuScenes and Waymo, and it is collected across a range of different areas, periods and weather conditions.
66 PAPERS • NO BENCHMARKS YET
For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real images. Hypersim is a photorealistic synthetic dataset for holistic indoor scene understanding. It contains 77,400 images of 461 indoor scenes with detailed per-pixel labels and corresponding ground truth geometry.
59 PAPERS • 1 BENCHMARK
OPV2V is a large-scale open simulated dataset for Vehicle-to-Vehicle perception. It contains over 70 interesting scenes, 11,464 frames, and 232,913 annotated 3D vehicle bounding boxes, collected from 8 towns in CARLA and a digital town of Culver City, Los Angeles.
54 PAPERS • 2 BENCHMARKS
The Objectron dataset is a collection of short, object-centric video clips, which are accompanied by AR session metadata that includes camera poses, sparse point-clouds and characterization of the planar surfaces in the surrounding environment. In each video, the camera moves around the object, capturing it from different angles. The data also contain manually annotated 3D bounding boxes for each object, which describe the object’s position, orientation, and dimensions. The dataset consists of 15K annotated video clips supplemented with over 4M annotated images in the following categories: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes. To ensure geo-diversity, the dataset is collected from 10 countries across five continents.
38 PAPERS • NO BENCHMARKS YET
PandaSet is a dataset produced by a complete, high-precision autonomous vehicle sensor kit with a no-cost commercial license. The dataset was collected using one 360x360 mechanical spinning LiDAR, one forward-facing, long-range LiDRAR, and 6 cameras. The datasets contains more than 100 scenes, each of which is 8 seconds long, and provides 28 types of labels for object classification and 37 types of annotations for semantic segmentation.
37 PAPERS • NO BENCHMARKS YET
The A*3D dataset is a step forward to make autonomous driving safer for pedestrians and the public in the real world. Characteristics: * 230K human-labeled 3D object annotations in 39,179 LiDAR point cloud frames and corresponding frontal-facing RGB images. * Captured at different times (day, night) and weathers (sun, cloud, rain).
34 PAPERS • NO BENCHMARKS YET
The H3D is a large scale full-surround 3D multi-object detection and tracking dataset. It is gathered from HDD dataset, a large scale naturalistic driving dataset collected in San Francisco Bay Area. H3D consists of following features:
33 PAPERS • NO BENCHMARKS YET
DAIR-V2X is a large-scale, multi-modality, multi-view dataset from real scenarios for VICAD. DAIR-V2X comprises 71254 LiDAR frames and 71254 Camera frames, and all frames are captured from real scenes with 3D annotations.
32 PAPERS • 2 BENCHMARKS
We introduce an object detection dataset in challenging adverse weather conditions covering 12000 samples in real-world driving scenes and 1500 samples in controlled weather conditions within a fog chamber. The dataset includes different weather conditions like fog, snow, and rain and was acquired by over 10,000 km of driving in northern Europe. The driven route with cities along the road is shown on the right. In total, 100k Objekts were labeled with accurate 2D and 3D bounding boxes. The main contributions of this dataset are: - We provide a proving ground for a broad range of algorithms covering signal enhancement, domain adaptation, object detection, or multi-modal sensor fusion, focusing on the learning of robust redundancies between sensors, especially if they fail asymmetrically in different weather conditions. - The dataset was created with the initial intention to showcase methods, which learn of robust redundancies between the sensor and enable a raw data sensor fusion in cas
30 PAPERS • 2 BENCHMARKS
🤖 Robo3D - The nuScenes-C Benchmark nuScenes-C is an evaluation benchmark heading toward robust and reliable 3D perception in autonomous driving. With it, we probe the robustness of 3D detectors and segmentors under out-of-distribution (OoD) scenarios against corruptions that occur in the real-world environment. Specifically, we consider natural corruptions happen in the following cases:
26 PAPERS • 3 BENCHMARKS
A large-scale V2X perception dataset using CARLA and OpenCDA
24 PAPERS • 1 BENCHMARK
Collected with the Autonomoose autonomous vehicle platform, based on a modified Lincoln MKZ.
19 PAPERS • NO BENCHMARKS YET
🤖 Robo3D - The KITTI-C Benchmark KITTI-C is an evaluation benchmark heading toward robust and reliable 3D object detection in autonomous driving. With it, we probe the robustness of 3D detectors under out-of-distribution (OoD) scenarios against corruptions that occur in the real-world environment. Specifically, we consider natural corruptions happen in the following cases:
18 PAPERS • 2 BENCHMARKS
15 PAPERS • 2 BENCHMARKS
V2X-Sim, short for vehicle-to-everything simulation, is the a synthetic collaborative perception dataset in autonomous driving developed by AI4CE Lab at NYU and MediaBrain Group at SJTU to facilitate collaborative perception between multiple vehicles and roadside infrastructure. Data is collected from both roadside and vehicles when they are presented near the same intersection. With information from both the roadside infrastructure and vehicles, the dataset aims to encourage research on collaborative perception tasks.
15 PAPERS • 1 BENCHMARK
Consists of over 50,000 frames and includes high-definition images with full resolution depth information, semantic segmentation (images), point-wise segmentation (point clouds), and detailed annotations for all vehicles and people.
13 PAPERS • NO BENCHMARKS YET
V2V4Real is a large-scale real-world multi-modal dataset for V2V perception. The data is collected by two vehicles equipped with multi modal sensors driving together through diverse scenarios. It covers a driving area of 410 km comprising 20K LiDAR frames, 40K RGB frames, 240K annotated 3D bounding boxes for 5 classes, and HDMaps that cover all the driving routes.
KAIST-Radar (K-Radar) is a novel large-scale object detection dataset and benchmark that contains 35K frames of 4D Radar tensor (4DRT) data with power measurements along the Doppler, range, azimuth, and elevation dimensions, together with carefully annotated 3D bounding box labels of objects on the roads. K-Radar includes challenging driving conditions such as adverse weathers (fog, rain, and snow) on various road structures (urban, suburban roads, alleyways, and highways). In addition to the 4DRT, we provide auxiliary measurements from carefully calibrated high-resolution Lidars, surround stereo cameras, and RTK-GPS.
12 PAPERS • NO BENCHMARKS YET
Detecting vehicles and representing their position and orientation in the three dimensional space is a key technology for autonomous driving. Recently, methods for 3D vehicle detection solely based on monocular RGB images gained popularity. In order to facilitate this task as well as to compare and drive state-of-the-art methods, several new datasets and benchmarks have been published. Ground truth annotations of vehicles are usually obtained using lidar point clouds, which often induces errors due to imperfect calibration or synchronization between both sensors. To this end, we propose Cityscapes 3D, extending the original Cityscapes dataset with 3D bounding box annotations for all types of vehicles. In contrast to existing datasets, our 3D annotations were labeled using stereo RGB images only and capture all nine degrees of freedom. This leads to a pixel-accurate reprojection in the RGB image and a higher range of annotations compared to lidar-based approaches. In order to ease multi
9 PAPERS • 3 BENCHMARKS
Roadside Perception 3D (Rope3D) is a dataset for autonomous driving and monocular 3D object detection task consisting of 50k images and over 1.5M 3D objects in various scenes, which are captured under different settings including various cameras with ambiguous mounting positions, camera specifications, viewpoints, and different environmental conditions.
9 PAPERS • 1 BENCHMARK
Robust detection and tracking of objects is crucial for the deployment of autonomous vehicle technology. Image based benchmark datasets have driven development in computer vision tasks such as object detection, tracking and segmentation of agents in the environment. Most autonomous vehicles, however, carry a combination of cameras and range sensors such as lidar and radar. As machine learning based methods for detection and tracking become more prevalent, there is a need to train and evaluate such methods on datasets containing range sensor data along with images. In this work we present nuTonomy scenes (nuScenes), the first dataset to carry the full autonomous vehicle sensor suite: 6 cameras, 5 radars and 1 lidar, all with full 360 degree field of view. nuScenes comprises 1000 scenes, each 20s long and fully annotated with 3D bounding boxes for 23 classes and 8 attributes. It has 7x as many annotations and 100x as many images as the pioneering KITTI dataset. We define novel 3D detecti
9 PAPERS • 2 BENCHMARKS
4D-OR includes a total of 6734 scenes, recorded by six calibrated RGB-D Kinect sensors 1 mounted to the ceiling of the OR, with one frame-per-second, providing synchronized RGB and depth images. We provide fused point cloud sequences of entire scenes, automatically annotated human 6D poses and 3D bounding boxes for OR objects. Furthermore, we provide SSG annotations for each step of the surgery together with the clinical roles of all the humans in the scenes, e.g., nurse, head surgeon, anesthesiologist.
8 PAPERS • 1 BENCHMARK
Falling Things (FAT) is a dataset for advancing the state-of-the-art in object detection and 3D pose estimation in the context of robotics. It consists of generated photorealistic images with accurate 3D pose annotations for all objects in 60k images.
6 PAPERS • NO BENCHMARKS YET
CrashD is a test benchmark for the robustness and generalization of 3D object detection models. It contains a wide range of out-of-distribution vehicles, including damaged, classic, and sports cars.
4 PAPERS • NO BENCHMARKS YET
🤖 Robo3D - The WOD-C Benchmark WOD-C is an evaluation benchmark heading toward robust and reliable 3D perception in autonomous driving. With it, we probe the robustness of 3D detectors and segmentors under out-of-distribution (OoD) scenarios against corruptions that occur in the real-world environment. Specifically, we consider natural corruptions happen in the following cases:
4 PAPERS • 1 BENCHMARK
Vehicle-to-Everything (V2X) network has enabled collaborative perception in autonomous driving, which is a promising solution to the fundamental defect of stand-alone intelligence including blind zones and long-range perception. However, the lack of datasets has severely blocked the development of collaborative perception algorithms. In this work, we release DOLPHINS: Dataset for cOllaborative Perception enabled Harmonious and INterconnected Self-driving, as a new simulated large-scale various-scenario multi-view multi-modality autonomous driving dataset, which provides a ground-breaking benchmark platform for interconnected autonomous driving. DOLPHINS outperforms current datasets in six dimensions: temporally-aligned images and point clouds from both vehicles and Road Side Units (RSUs) enabling both Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I) based collaborative perception; 6 typical scenarios with dynamic weather conditions make the most various interconnected auton
3 PAPERS • NO BENCHMARKS YET
3 PAPERS • 1 BENCHMARK
Real-world dataset of ~400 images of cuboid-shaped parcels with full 2D and 3D annotations in the COCO format.
LiDAR-CS is a dataset for 3D object detection in real traffic. It contains 84,000 point cloud frames under 6 groups of different sensors but with same corresponding scenarios, captured from hybrid realistic LivDAR simulator.
2 PAPERS • NO BENCHMARKS YET
2 PAPERS • 1 BENCHMARK
Synthetic dataset of over 13,000 images of damaged and intact parcels with full 2D and 3D annotations in the COCO format. For details see our paper and for visual samples our project page.
The Cooperative Driving dataset is a synthetic dataset generated using CARLA that contains lidar data from multiple vehicles navigating simultaneously through a diverse set of driving scenarios. This dataset was created to enable further research in multi-agent perception (cooperative perception) including cooperative 3D object detection, cooperative object tracking, multi-agent SLAM and point cloud registration. Towards that goal, all the frames have been labelled with ground-truth sensor pose and 3D object bounding boxes.
1 PAPER • NO BENCHMARKS YET
In this dataset an uppertorso humanoid robot with 7-DOF arm explored 100 different objects belonging to 20 different categories using 10 behaviors: Look, Crush, Grasp, Hold, Lift, Drop, Poke, Push, Shake and Tap.
A large-scale comprehensive collection of dashcam videos collected by vehicles on DiDi's platform. D2-City contains more than 10000 video clips which deeply reflect the diversity and complexity of real-world traffic scenarios in China.
The MultiviewC dataset mainly contributes to multiview cattle action recognition, 3D objection detection and tracking. We build a novel synthetic dataset MultiviewC through UE4 based on real cattle video dataset which is offered by CISRO. The format of our data set has been adjusted on the basis of MultiviewX for set-up, annotation and files structure.
aiMotive dataset is a multimodal dataset for robust autonomous driving with long-range perception. The dataset consists of 176 scenes with synchronized and calibrated LiDAR, camera, and radar sensors covering a 360-degree field of view. The collected data was captured in highway, urban, and suburban areas during daytime, night, and rain and is annotated with 3D bounding boxes with consistent identifiers across frames.
1 PAPER • 1 BENCHMARK
The dataset consists of images of Human palms captured using a mobile phone. The images have been taken in a real-world scenario like holding objects or performing simple gestures. The dataset has a wide variety of variations like illumination, distances etc. It consists of images of 3 main gestures: Frontal-open palm, Back open palm and fist with the wrist. It also has a lot of images with people wearing gloves.
0 PAPER • NO BENCHMARKS YET
H3D (Humans in 3D) is a dataset of annotated people. The annotations include:
To facilitate research on asynchrony for collaborative perception, we simulate the first collaborative perception dataset with different temporal asynchronies based on CARLA, named IRregular V2V(IRV2V). We set 100ms as ideal sampling time interval and simulate various asynchronies in real-world scenarios from two main aspects: i) considering that agents are unsynchronized with the unified global clock, we uniformly sample a time shift $\delta_s\sim \mathcal{U}(-50,50)\text{ms}$ for each agent in the same scene, and ii) considering the trigger noise of the sensors, we uniformly sample a time turbulence $\delta_d\sim \mathcal{U}(-10,10)\text{ms}$ for each sampling timestamp. The final asynchronous time interval between adjacent timestamps is the summation of the time shift and time turbulence. In experiments, we also sample the frame intervals to achieve large-scale and diverse asynchrony. Each scene includes multiple collaborative agents ranging from 2 to 5. Each agent is equipped with
0 PAPER • 1 BENCHMARK