KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile robotics and autonomous driving. It consists of hours of traffic scenarios recorded with a variety of sensor modalities, including high-resolution RGB, grayscale stereo cameras, and a 3D laser scanner. Despite its popularity, the dataset itself does not contain ground truth for semantic segmentation. However, various researchers have manually annotated parts of the dataset to fit their necessities. Álvarez et al. generated ground truth for 323 images from the road detection challenge with three classes: road, vertical, and sky. Zhang et al. annotated 252 (140 for training and 112 for testing) acquisitions – RGB and Velodyne scans – from the tracking challenge for ten object categories: building, sky, road, vegetation, sidewalk, car, pedestrian, cyclist, sign/pole, and fence. Ros et al. labeled 170 training images and 46 testing images (from the visual odome
3,219 PAPERS • 141 BENCHMARKS
xR-EgoPose is an egocentric synthetic dataset for egocentric 3D human pose estimation. It consists of ~380 thousand photo-realistic egocentric camera images in a variety of indoor and outdoor spaces.
10 PAPERS • NO BENCHMARKS YET
UnrealEgo is a dataset that provides in-the-wild stereo images with a large variety of motions for 3D human pose estimation. The in-the-wild stereo images are stereo fisheye images and depth maps with a resolution of 1024×1024 pixels each with 25 frames per second and a total of 450k (900k images) are captured for the dataset. Metadata is provided for each frame, including 3D joint positions, camera positions, and 2D coordinates of reprojected joint positions in the fisheye views.
8 PAPERS • 1 BENCHMARK
Evaluating human-scene interaction requires precise annotations for camera pose and scene geometry. However, such information is not available in existing datasets for egocentric human pose estimation. To solve this issue, we collected a new real-world dataset using a head-mounted fisheye camera combined with a calibration board. The ground truth scene geometry is obtained with the SfM method from a multi-view capture system with 120 synced 4K resolution cameras and the ground truth egocentric camera pose is obtained by localizing a calibration board rigidly attached to the egocentric camera. This dataset contains around 28K frames of two actors, performing various human-scene interacting motions such as sitting, reading a newspaper, and using a computer. This dataset is evenly split into training and testing splits. We fine-tuned the method on the training split before the evaluation. This dataset will be made publicly available and additional details of it are shown in the supplement
7 PAPERS • 1 BENCHMARK
Egocentric motion capture dataset
6 PAPERS • 1 BENCHMARK
EgoPW training dataset with scene annotations Since we want to generalize to data captured with a real head-mounted camera, we also extended the EgoPW training dataset. For this, we first reconstruct the scene geometry from the egocentric image sequences of the EgoPW training dataset with a Structure-from-Motion (SfM) algorithm. This step provides a dense reconstruction of the background scene. The global scale of the reconstruction is recovered from known objects present in the sequences, such as laptops and chairs. We further render the depth maps of the scene in the egocentric perspective based on the reconstructed geometry. Our EgoPW-Scene dataset contains 92 K frames in total, which are distributed in 30 sequences performed by 5 actors.
1 PAPER • NO BENCHMARKS YET