ShapeNet is a large scale repository for 3D CAD models developed by researchers from Stanford University, Princeton University and the Toyota Technological Institute at Chicago, USA. The repository contains over 300M models with 220,000 classified into 3,135 classes arranged using WordNet hypernym-hyponym relationships. ShapeNet Parts subset contains 31,693 meshes categorised into 16 common object classes (i.e. table, chair, plane etc.). Each shapes ground truth contains 2-5 parts (with a total of 50 part classes).
1,689 PAPERS • 13 BENCHMARKS
ScanNet is an instance-level indoor RGB-D dataset that includes both 2D and 3D data. It is a collection of labeled voxels rather than points or objects. Up to now, ScanNet v2, the newest version of ScanNet, has collected 1513 annotated scans with an approximate 90% surface coverage. In the semantic segmentation task, this dataset is marked in 20 classes of annotated 3D voxelized objects.
1,240 PAPERS • 19 BENCHMARKS
DTU MVS 2014 is a multi-view stereo dataset, which is an order of magnitude larger in number of scenes and with a significant increase in diversity. Specifically, it contains 80 scenes of large variability. Each scene consists of 49 or 64 accurate camera positions and reference structured light scans, all acquired by a 6-axis industrial robot.
219 PAPERS • 2 BENCHMARKS
The 300-W is a face dataset that consists of 300 Indoor and 300 Outdoor in-the-wild images. It covers a large variation of identity, expression, illumination conditions, pose, occlusion and face size. The images were downloaded from google.com by making queries such as “party”, “conference”, “protests”, “football” and “celebrities”. Compared to the rest of in-the-wild datasets, the 300-W database contains a larger percentage of partially-occluded images and covers more expressions than the common “neutral” or “smile”, such as “surprise” or “scream”. Images were annotated with the 68-point mark-up using a semi-automatic methodology. The images of the database were carefully selected so that they represent a characteristic sample of challenging but natural face instances under totally unconstrained conditions. Thus, methods that achieve accurate performance on the 300-W database can demonstrate the same accuracy in most realistic cases. Many images of the database contain more than one a
198 PAPERS • 9 BENCHMARKS
ShapeNetCore is a subset of the full ShapeNet dataset with single clean 3D models and manually verified category and alignment annotations. It covers 55 common object categories with about 51,300 unique 3D models. The 12 object categories of PASCAL 3D+, a popular computer vision 3D benchmark dataset, are all covered by ShapeNetCore.
156 PAPERS • 1 BENCHMARK
The MegaDepth dataset is a dataset for single-view depth prediction that includes 196 different locations reconstructed from COLMAP SfM/MVS.
115 PAPERS • NO BENCHMARKS YET
The ABC Dataset is a collection of one million Computer-Aided Design (CAD) models for research of geometric deep learning methods and applications. Each model is a collection of explicitly parametrized curves and surfaces, providing ground truth for differential quantities, patch segmentation, geometric feature detection, and shape reconstruction. Sampling the parametric descriptions of surfaces and curves allows generating data in different formats and resolutions, enabling fair comparisons for a wide range of geometric learning algorithms.
83 PAPERS • NO BENCHMARKS YET
ETHD is a multi-view stereo benchmark / 3D reconstruction benchmark that covers a variety of indoor and outdoor scenes. Ground truth geometry has been obtained using a high-precision laser scanner. A DSLR camera as well as a synchronized multi-camera rig with varying field-of-view was used to capture images.
79 PAPERS • 1 BENCHMARK
BlendedMVS is a novel large-scale dataset, to provide sufficient training ground truth for learning-based MVS. The dataset was created by applying a 3D reconstruction pipeline to recover high-quality textured meshes from images of well-selected scenes. Then, these mesh models were rendered to color images and depth maps.
74 PAPERS • NO BENCHMARKS YET
Scan2CAD is an alignment dataset based on 1506 ScanNet scans with 97607 annotated keypoints pairs between 14225 (3049 unique) CAD models from ShapeNet and their counterpart objects in the scans. The top 3 annotated model classes are chairs, tables and cabinets which arises due to the nature of indoor scenes in ScanNet. The number of objects aligned per scene ranges from 1 to 40 with an average of 9.3.
65 PAPERS • 1 BENCHMARK
For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real images. Hypersim is a photorealistic synthetic dataset for holistic indoor scene understanding. It contains 77,400 images of 461 indoor scenes with detailed per-pixel labels and corresponding ground truth geometry.
59 PAPERS • 1 BENCHMARK
ABO is a large-scale dataset designed for material prediction and multi-view retrieval experiments. The dataset contains Blender renderings of 30 viewpoints for each of the 7,953 3D objects, as well as camera intrinsics and extrinsic for each rendering.
46 PAPERS • NO BENCHMARKS YET
We present a benchmark for image-based 3D reconstruction. The benchmark sequences were acquired outside the lab, in realistic conditions. Ground-truth data was captured using an industrial laser scanner. The benchmark includes both outdoor scenes and indoor environments. High-resolution video sequences are provided as input, supporting the development of novel pipelines that take advantage of video input to increase reconstruction fidelity. We report the performance of many image-based 3D reconstruction pipelines on the new benchmark. The results point to exciting challenges and opportunities for future work.
39 PAPERS • 2 BENCHMARKS
Enables detailed human body model reconstruction in clothing from a single monocular RGB video without requiring a pre scanned template or manually clicked points.
33 PAPERS • NO BENCHMARKS YET
Dynamic FAUST extends the FAUST dataset to dynamic 4D data. It consists of high-resolution 4D scans of human subjects in motion, captured at 60 fps.
28 PAPERS • 1 BENCHMARK
ApolloCar3DT is a dataset that contains 5,277 driving images and over 60K car instances, where each car is fitted with an industry-grade 3D CAD model with absolute model size and semantically labelled keypoints. This dataset is above 20 times larger than PASCAL3D+ and KITTI, the current state-of-the-art.
17 PAPERS • 14 BENCHMARKS
Common Objects in 3D is a large-scale dataset with real multi-view images of object categories annotated with camera poses and ground truth 3D point clouds. The dataset contains a total of 1.5 million frames from nearly 19,000 videos capturing objects from 50 MS-COCO categories and, as such, it is significantly larger than alternatives both in terms of the number of categories and objects.
17 PAPERS • 1 BENCHMARK
SceneNet is a dataset of labelled synthetic indoor scenes. There are several labeled indoor scenes, including:
17 PAPERS • NO BENCHMARKS YET
A novel benchmark dataset that includes a manually annotated point cloud for over 260 million laser scanning points into 100'000 (approx.) assets from Dublin LiDAR point cloud [12] in 2015. Objects are labelled into 13 classes using hierarchical levels of detail from large (i.e., building, vegetation and ground) to refined (i.e., window, door and tree) elements.
13 PAPERS • NO BENCHMARKS YET
ARCTIC is a dataset of free-form interactions of hands and articulated objects. ARCTIC has 1.2M images paired with accurate 3D meshes for both hands and for objects that move and deform over time. The dataset also provides hand-object contact information.
9 PAPERS • NO BENCHMARKS YET
The ETH SfM (structure-from-motion) dataset is a dataset for 3D Reconstruction. The benchmark investigates how different methods perform in terms of building a 3D model from a set of available 2D images.
8 PAPERS • NO BENCHMARKS YET
The Oxford-Affine dataset is a small dataset containing 8 scenes with sequence of 6 images per scene. The images in a sequence are related by homographies.
7 PAPERS • NO BENCHMARKS YET
A large-scale synthetic dataset with 2.5 Million photo-realistic images of 80 subjects performing 70 activities and wearing diverse outfits.
6 PAPERS • NO BENCHMARKS YET
Generate high-quality 3D ground-truth shapes for reconstruction evaluation is extremely challenging because even 3D scanners can only generate pseudo ground-truth shapes with artefacts. We propose a novel data capturing and 3D annotation pipeline to obtain precise 3D ground-truth shapes without relying on expensive 3D scanners. The key to creating the precise 3D ground-truth shapes is using LEGO models, which are made of LEGO bricks with known geometry. The MobileBrick dataset provides a unique opportunity for future research on high-quality 3D reconstruction thanks to two distinctive features: 1) A large number of RGBD sequences with precise 3D ground-truth annotations. 2) The RGBD images were captured using mobile devices so algorithms can be tested in a realistic setup for mobile AR applications.
MonoPerfCap is a benchmark dataset for human 3D performance capture from monocular video input consisting of around 40k frames, which covers a variety of different scenarios.
A Large Dataset of Object Scans is a dataset of more than ten thousand 3D scans of real objects. To create the dataset, the authors recruited 70 operators, equipped them with consumer-grade mobile 3D scanning setups, and paid them to scan objects in their environments. The operators scanned objects of their choosing, outside the laboratory and without direct supervision by computer vision professionals. The result is a large and diverse collection of object scans: from shoes, mugs, and toys to grand pianos, construction vehicles, and large outdoor sculptures. The authors worked with an attorney to ensure that data acquisition did not violate privacy constraints. The acquired data was placed in the public domain and is available freely.
5 PAPERS • NO BENCHMARKS YET
HOD is a dataset for 3D object reconstruction which contains 35 objects, divided into two subsets named Sculptures and Daily Objects. The Sculptures has five human sculptures with complex geometries and pure white textures. The Daily Objects consists of 30 daily objects with various shapes and appearances. All of the Sculptures and nine of the Daily Objects are paired with high-fidelity scanned meshes as ground truth geometries for evaluation.
4 PAPERS • NO BENCHMARKS YET
Houses3K is a dataset of 3000 textured 3D house models. Houses3K is divided into twelve batches, each containing 50 unique house geometries. For each batch, five different textures were applied forming the sets (A, B, C, D, E).
CoP3D is a collection of crowd-sourced videos showing around 4,200 distinct pets. CoP2D is a large-scale datasets for benchmarking non-rigid 3D reconstruction "in the wild".
3 PAPERS • NO BENCHMARKS YET
Dynamic Replica is a synthetic dataset of stereo videos featuring humans and animals in virtual environments. It is a benchmark for dynamic disparity/depth estimation and 3D reconstruction consisting of 145,200 stereo frames (524 videos).
Real-world dataset of ~400 images of cuboid-shaped parcels with full 2D and 3D annotations in the COCO format.
ShapenetRenderer is an extension of the ShapeNet Core dataset which has more variation in camera angles. For each mesh model, the dataset provides 36 views with smaller variation and 36 views with larger variation. The resolution of the newly rendered images is 224x224 in contrast to the 137x137 original resolution. Additionally, each RGB image is paired with a depth image, a normal map and an albedo image.
Tragic Talkers is an audio-visual dataset consisting of excerpts from the "Romeo and Juliet" drama captured with microphone arrays and multiple co-located cameras for light-field video. Tragic Talkers provides ideal content for object-based media (OBM) production. It is designed to cover various conventional talking scenarios, such as monologues, two-people conversations, and interactions with considerable movement and occlusion, yielding 30 sequences captured from a total of 22 different points of view and two 16-element microphone arrays.
A new synthetic, multi-purpose dataset - called ENRICH - for testing photogrammetric and computer vision algorithms. Compared to existing datasets, ENRICH offers higher resolution images also rendered with different lighting conditions, camera orientation, scales, and field of view. Specifically, ENRICH is composed of three sub-datasets: ENRICH-Aerial, ENRICH-Square, and ENRICH-Statue, each exhibiting different characteristics. The proposed dataset is useful for several photogrammetry and computer vision-related tasks, such as the evaluation of hand-crafted and deep learning-based local features, effects of ground control points (GCPs) configuration on the 3D accuracy, and monocular depth estimation.
2 PAPERS • NO BENCHMARKS YET
A dataset of high resolution, textured scans of articulated left feet, useful for 3D shape representation learning.
General-purpose Visual Understanding Evaluation (G-VUE) is a comprehensive benchmark covering the full spectrum of visual cognitive abilities with four functional domains -- Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and manipulation.
Dataset page: https://github.com/mosamdabhi/MBW-Data
Middlebury MVS is the earliest MVS dataset for multi-view stereo network evaluation. It contains two indoor objects with low-resolution (640 × 480) images and calibrated cameras.
Pano3D is a new benchmark for depth estimation from spherical panoramas. Its goal is to drive progress for this task in a consistent and holistic manner. The Pano3D 360 depth estimation benchmark provides a standard Matterport3D train and test split, as well as a secondary GibsonV2 partioning for testing and training as well. The latter is used for zero-shot cross dataset transfer performance assessment and decomposes it into 3 different splits, each one focusing on a specific generalization axis.
Synthetic dataset of over 13,000 images of damaged and intact parcels with full 2D and 3D annotations in the COCO format. For details see our paper and for visual samples our project page.
AMT Objects is a large dataset of object centric videos suitable for training and benchmarking models for generating 3D models of objects from a small number of photos of the objects. The dataset consists of multiple views of a large collection of object instances.
1 PAPER • NO BENCHMARKS YET
Comprises 4 different subsets - Flat, House, Priory and Lab - each containing a number of different sequences that can be successfully relocalised against each other.
DRACO20K dataset is used for evaluating object canonicalization on methods that estimate a canonical frame from a monocular input image.
Danish Airs and Grounds (DAG) is a large collection of street-level and aerial images targeting such cases. Its main challenge lies in the extreme viewing-angle difference between query and reference images with consequent changes in illumination and perspective. The dataset is larger and more diverse than current publicly available data, including more than 50 km of road in urban, suburban and rural areas. All images are associated with accurate 6-DoF metadata that allows the benchmarking of visual localization methods.
Estimating camera motion in deformable scenes poses a complex and open research challenge. Most existing non-rigid structure from motion techniques assume to observe also static scene parts besides deforming scene parts in order to establish an anchoring reference. However, this assumption does not hold true in certain relevant application cases such as endoscopies. To tackle this issue with a common benchmark, we introduce the Drunkard’s Dataset, a challenging collection of synthetic data targeting visual navigation and reconstruction in deformable environments. This dataset is the first large set of exploratory camera trajectories with ground truth inside 3D scenes where every surface exhibits non-rigid deformations over time. Simulations in realistic 3D buildings lets us obtain a vast amount of data and ground truth labels, including camera poses, RGB images and depth, optical flow and normal maps at high resolution and quality.
1 PAPER • 1 BENCHMARK
The dfd_indoor dataset contains 110 images for training and 29 images for testing. The dfd_outdoor dataset contains 34 images for tests; no ground truth was given for this dataset, as the depth sensor only works on indoor scenes.
Replay is a collection of multi-view, multi-modal videos of humans interacting socially. Each scene is filmed in high production quality, from different viewpoints with several static cameras, as well as wearable action cameras, and recorded with a large array of microphones at different positions in the room. The full Replay dataset consists of 68 scenes of social interactions between people, such as playing boarding games, exercising, or unwrapping presents. Each scene is about 5 minutes long and filmed with 12 cameras, static and dynamic. Audio is captured separately by 12 binaural microphones and additional near-range microphones for each actor and for each egocentric video. All sensors are temporally synchronized, undistorted, geometrically calibrated, and color calibrated.
This synthetic event dataset is used in Robust e-NeRF to study the collective effect of camera speed profile, contrast threshold variation and refractory period on the quality of NeRF reconstruction from a moving event camera. It is simulated using an improved version of ESIM with three different camera configurations of increasing difficulty levels (i.e. easy, medium and hard) on seven Realistic Synthetic $360^{\circ}$ scenes (adopted in the synthetic experiments of NeRF), resulting in a total of 21 sequence recordings. Please refer to the Robust e-NeRF paper for more details.