The YCB-Video dataset is a large-scale video dataset for 6D object pose estimation. provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames.
146 PAPERS • 6 BENCHMARKS
REAL275 is a benchmark for category-level pose estimation. It contains 4300 training frames, 950 validation and 2750 for testing across 18 different real scenes.
53 PAPERS • 1 BENCHMARK
The LM (Linemod) dataset is a valuable resource introduced by Stefan Hinterstoisser and colleagues in their research on model-based training, detection, and pose estimation of texture-less 3D objects in heavily cluttered scenes¹. Let's delve into the details:
33 PAPERS • 5 BENCHMARKS
HomebrewedDB is a dataset for 6D pose estimation mainly targeting training from 3D models (both textured and textureless), scalability, occlusions, and changes in light conditions and object appearance. The dataset features 33 objects (17 toy, 8 household and 8 industry-relevant objects) over 13 scenes of various difficulty. It also consists of a set of benchmarks to test various desired detector properties, particularly focusing on scalability with respect to the number of objects and resistance to changing light conditions, occlusions and clutter.
24 PAPERS • NO BENCHMARKS YET
ApolloCar3DT is a dataset that contains 5,277 driving images and over 60K car instances, where each car is fitted with an industry-grade 3D CAD model with absolute model size and semantically labelled keypoints. This dataset is above 20 times larger than PASCAL3D+ and KITTI, the current state-of-the-art.
17 PAPERS • 14 BENCHMARKS
A new dataset with significant occlusions related to object manipulation.
4 PAPERS • NO BENCHMARKS YET
We present a large-scale dataset for 3D urban scene understanding. Compared to existing datasets, our dataset consists of 75 outdoor urban scenes with diverse backgrounds, encompassing over 15,000 images. These scenes offer 360◦ hemispherical views, capturing diverse foreground objects illuminated under various lighting conditions. Additionally, our dataset encompasses scenes that are not limited to forward-driving views, addressing the limitations of previous datasets such as limited overlap and coverage between camera views. The closest pre-existing dataset for generalizable evaluation is DTU [2] (80 scenes) which comprises mostly indoor objects and does not provide multiple foreground objects or background scenes.
3 PAPERS • 1 BENCHMARK
CORSMAL is a dataset for estimating the position and orientation in 3D (or 6D pose) of an object from a single view. The dataset consists of 138,240 images of rendered hands and forearms holding 48 synthetic objects, split into 3 grasp categories over 30 real backgrounds.
2 PAPERS • NO BENCHMARKS YET
Accurately tracking the six degree-of-freedom pose of an object in real scenes is an important task in computer vision and augmented reality with numerous applications. Although a variety of algorithms for this task have been proposed, it remains difficult to evaluate existing methods in the literature as oftentimes different sequences are used and no large benchmark datasets close to real-world scenarios are available. In this paper, we present a large object pose tracking benchmark dataset consisting of RGB-D video sequences of 2D and 3D targets with ground-truth information. The videos are recorded under various lighting conditions, different motion patterns and speeds with the help of a programmable robotic arm. We present extensive quantitative evaluation results of the state-of-the-art methods on this benchmark dataset and discuss the potential research directions in this field.
2 PAPERS • 1 BENCHMARK
Dataset consist of both real captures from Photoneo PhoXi structured light scanner devices annotated by hand and synthetic samples produced by custom generator. In comparison with existing datasets for 6D pose estimation, some notable differences include:
1 PAPER • 1 BENCHMARK
Estimating camera motion in deformable scenes poses a complex and open research challenge. Most existing non-rigid structure from motion techniques assume to observe also static scene parts besides deforming scene parts in order to establish an anchoring reference. However, this assumption does not hold true in certain relevant application cases such as endoscopies. To tackle this issue with a common benchmark, we introduce the Drunkard’s Dataset, a challenging collection of synthetic data targeting visual navigation and reconstruction in deformable environments. This dataset is the first large set of exploratory camera trajectories with ground truth inside 3D scenes where every surface exhibits non-rigid deformations over time. Simulations in realistic 3D buildings lets us obtain a vast amount of data and ground truth labels, including camera poses, RGB images and depth, optical flow and normal maps at high resolution and quality.
This data set contains over 600GB of multimodal data from a Mars analog mission, including accurate 6DoF outdoor ground truth, indoor-outdoor transitions with continuous cross-domain ground truth, and indoor data with Optitrack measurements as ground truth. With 26 flights and a combined distance of 2.5km, this data set provides you with various distinct challenges for testing and proofing your algorithms. The UAV carries 18 sensors, including a high-resolution navigation camera and a stereo camera with an overlapping field of view, two RTK GNSS sensors with centimeter accuracy, as well as three IMUs, placed at strategic locations: Hardware dampened at the center, off-center with a lever arm, and a 1kHz IMU rigidly attached to the UAV (in case you want to work with unfiltered data). The sensors are fully pre-calibrated, and the data set is ready to use. However, if you want to use your own calibration algorithms, then the raw calibration data is also ready for download. The cross-domai
1 PAPER • NO BENCHMARKS YET
KITTI-6DoF is a dataset that contains annotations for the 6DoF estimation task for 5 object categories on 7,481 frames.
The SMOT dataset, Single sequence-Multi Objects Training, is collected to represent a practical scenario of collecting training images of new objects in the real world, i.e. a mobile robot with an RGB-D camera collects a sequence of frames while driving around a table to learning multiple objects and tries to recognize objects in different locations.
Data and experiments for motion-based extrinsic calibration using trajectory_calibration. The data was generated for evaluating hand-eye calibration algorithms in extrinsic sensor-to-sensor calibration.
Data used for the paper SparsePoser: Real-time Full-body Motion Reconstruction from Sparse Data
Estimating 6D object poses is a major challenge in 3D computer vision. Building on successful instance-level approaches, research is shifting towards category-level pose estimation for practical applications. Current categorylevel datasets, however, fall short in annotation quality and pose variety. Addressing this, we introduce HouseCat6D, a new category-level 6D pose dataset. It features 1) multimodality with Polarimetric RGB and Depth (RGBD+P), 2) encompasses 194 diverse objects across 10 household categories, including two photometrically challenging ones, and 3) provides high-quality pose annotations with an error range of only 1.35 mm to 1.74 mm. The dataset also includes 4) 41 large-scale scenes with comprehensive viewpoint and occlusion coverage, 5) a checkerboard-free environment, and 6) dense 6D parallel-jaw robotic grasp annotations. Additionally, we present benchmark results for leading category-level pose estimation networks.
0 PAPER • NO BENCHMARKS YET