The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.
10,147 PAPERS • 92 BENCHMARKS
Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense pixel annotations for 30 classes grouped into 8 categories (flat surfaces, humans, vehicles, constructions, objects, nature, sky, and void). The dataset consists of around 5000 fine annotated images and 20000 coarse annotated ones. Data was captured in 50 cities during several months, daytimes, and good weather conditions. It was originally recorded as video so the frames were manually selected to have the following features: large number of dynamic objects, varying scene layout, and varying background.
3,323 PAPERS • 54 BENCHMARKS
The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in Boston and Singapore. Each scene is 20 seconds long and annotated at 2Hz. This results in a total of 28130 samples for training, 6019 samples for validation and 6008 samples for testing. The dataset has the full autonomous vehicle data suite: 32-beam LiDAR, 6 cameras and radars with complete 360° coverage. The 3D object detection challenge evaluates the performance on 10 classes: cars, trucks, buses, trailers, construction vehicles, pedestrians, motorcycles, bicycles, traffic cones and barriers.
1,549 PAPERS • 20 BENCHMARKS
The ADE20K semantic segmentation dataset contains more than 20K scene-centric images exhaustively annotated with pixel-level objects and object parts labels. There are totally 150 semantic categories, which include stuffs like sky, road, grass, and discrete objects like person, car, bed.
995 PAPERS • 25 BENCHMARKS
The NYU-Depth V2 data set is comprised of video sequences from a variety of indoor scenes as recorded by both the RGB and Depth cameras from the Microsoft Kinect. It features:
841 PAPERS • 20 BENCHMARKS
LVIS is a dataset for long tail instance segmentation. It has annotations for over 1000 object categories in 164k images.
434 PAPERS • 14 BENCHMARKS
Datasets drive vision progress, yet existing driving datasets are impoverished in terms of visual content and supported tasks to study multitask learning for autonomous driving. Researchers are usually constrained to study a small set of problems on one dataset, while real-world computer vision applications require performing tasks of various complexities. We construct BDD100K, the largest driving video dataset with 100K videos and 10 tasks to evaluate the exciting progress of image recognition algorithms on autonomous driving. The dataset possesses geographic, environmental, and weather diversity, which is useful for training models that are less likely to be surprised by new conditions. Based on this diverse dataset, we build a benchmark for heterogeneous multitask learning and study how to solve the tasks together. Our experiments show that special training strategies are needed for existing models to perform such heterogeneous tasks. BDD100K opens the door for future studies in thi
359 PAPERS • 16 BENCHMARKS
KITTI-360 is a large-scale dataset that contains rich sensory information and full annotations. It is the successor of the popular KITTI dataset, providing more comprehensive semantic/instance labels in 2D and 3D, richer 360 degree sensory information (fisheye images and pushbroom laser scans), very accurate and geo-localized vehicle and camera poses, and a series of new challenging benchmarks.
161 PAPERS • 6 BENCHMARKS
YouTubeVIS is a new dataset tailored for tasks like simultaneous detection, segmentation and tracking of object instances in videos and is collected based on the current largest video object segmentation dataset YouTubeVOS.
146 PAPERS • 2 BENCHMARKS
Objects365 is a large-scale object detection dataset, Objects365, which has 365 object categories over 600K training images. More than 10 million, high-quality bounding boxes are manually labeled through a three-step, carefully designed annotation pipeline. It is the largest object detection dataset (with full annotation) so far and establishes a more challenging benchmark for the community.
134 PAPERS • 3 BENCHMARKS
PartNet is a consistent, large-scale dataset of 3D objects annotated with fine-grained, instance-level, and hierarchical 3D part information. The dataset consists of 573,585 part instances over 26,671 3D models covering 24 object categories. This dataset enables and serves as a catalyst for many tasks such as shape analysis, dynamic 3D scene modeling and simulation, affordance analysis, and others.
123 PAPERS • 3 BENCHMARKS
For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real images. Hypersim is a photorealistic synthetic dataset for holistic indoor scene understanding. It contains 77,400 images of 461 indoor scenes with detailed per-pixel labels and corresponding ground truth geometry.
59 PAPERS • 1 BENCHMARK
iSAID contains 655,451 object instances for 15 categories across 2,806 high-resolution images. The images of iSAID is the same as the DOTA-v1.0 dataset, which are manily collected from the Google Earth, some are taken by satellite JL-1, the others are taken by satellite GF-2 of the China Centre for Resources Satellite Data and Application.
57 PAPERS • 3 BENCHMARKS
WildDash is a benchmark evaluation method is presented that uses the meta-information to calculate the robustness of a given algorithm with respect to the individual hazards.
46 PAPERS • 2 BENCHMARKS
3D-FUTURE (3D FUrniture shape with TextURE) is a 3D dataset that contains 20,240 photo-realistic synthetic images captured in 5,000 diverse scenes, and 9,992 involved unique industrial 3D CAD shapes of furniture with high-resolution informative textures developed by professional designers.
33 PAPERS • NO BENCHMARKS YET
Our project (STPLS3D) aims to provide a large-scale aerial photogrammetry dataset with synthetic and real annotated 3D point clouds for semantic and instance segmentation tasks.
32 PAPERS • 3 BENCHMARKS
The SIXray dataset is constructed by the Pattern Recognition and Intelligent System Development Laboratory, University of Chinese Academy of Sciences. It contains 1,059,231 X-ray images which are collected from some several subway stations. There are six common categories of prohibited items, namely, gun, knife, wrench, pliers, scissors and hammer. It has three subsets called SIXray10, SIXray100 and SIXray1000, There are image-level annotations provided by human security inspectors for the whole dataset. In addition the images in the test set are annotated with a bounding-box for each prohibited item to evaluate the performance of object localization.
29 PAPERS • 1 BENCHMARK
The PASCAL-Scribble Dataset is an extension of the PASCAL dataset with scribble annotations for semantic segmentation. The annotations follow two different protocols. In the first protocol, the PASCAL VOC 2012 set is annotated, with 20 object categories (aeroplane, bicycle, ...) and one background category. There are 12,031 images annotated, including 10,582 images in the training set and 1,449 images in the validation set. In the second protocol, the 59 object/stuff categories and one background category involved in the PASCAL-CONTEXT dataset are used. Besides the 20 object categories in the first protocol, there are 39 extra categories (snow, tree, ...) included. This protocol is followed to annotate the PASCAL-CONTEXT dataset. 4,998 images in the training set have been annotated.
25 PAPERS • NO BENCHMARKS YET
UVO is a new benchmark for open-world class-agnostic object segmentation in videos. Besides shifting the problem focus to the open-world setup, UVO is significantly larger, providing approximately 8 times more videos compared with DAVIS, and 7 times more mask (instance) annotations per video compared with YouTube-VOS and YouTube-VIS. UVO is also more challenging as it includes many videos with crowded scenes and complex background motions. Some highlights of the dataset include:
23 PAPERS • 3 BENCHMARKS
Developing robot perception systems for handling objects in the real-world requires computer vision algorithms to be carefully scrutinized with respect to the expected operating domain. This demands large quantities of ground truth data to rigorously evaluate the performance of algorithms.
22 PAPERS • 1 BENCHMARK
The LIVECell (Label-free In Vitro image Examples of Cells) dataset is a large-scale microscopic image dataset for instance-segmentation of individual cells in 2D cell cultures.
14 PAPERS • 1 BENCHMARK
Fashionpedia consists of two parts: (1) an ontology built by fashion experts containing 27 main apparel categories, 19 apparel parts, 294 fine-grained attributes and their relationships; (2) a dataset with everyday and celebrity event fashion images annotated with segmentation masks and their associated per-mask fine-grained attributes, built upon the Fashionpedia ontology.
13 PAPERS • NO BENCHMARKS YET
The General Robust Image Task (GRIT) Benchmark is an evaluation-only benchmark for evaluating the performance and robustness of vision systems across multiple image prediction tasks, concepts, and data sources. GRIT hopes to encourage our research community to pursue the following research directions:
13 PAPERS • 8 BENCHMARKS
CryoNuSeg is a fully annotated FS-derived cryosectioned and H&E-stained nuclei instance segmentation dataset. The dataset contains images from 10 human organs that were not exploited in other publicly available datasets, and is provided with three manual mark-ups to allow measuring intra-observer and inter-observer variability.
11 PAPERS • NO BENCHMARKS YET
TACO is a growing image dataset of waste in the wild. It contains images of litter taken under diverse environments: woods, roads and beaches. These images are manually labelled and segmented according to a hierarchical taxonomy to train and evaluate object detection algorithms. The annotations are provided in COCO format.
9 PAPERS • NO BENCHMARKS YET
The AIRS (Aerial Imagery for Roof Segmentation) dataset provides a wide coverage of aerial imagery with 7.5 cm resolution and contains over 220,000 buildings. The task posed for AIRS is defined as roof segmentation.
8 PAPERS • 1 BENCHMARK
The TrashCan dataset is an instance-segmentation dataset of underwater trash. It is comprised of annotated images (7,212 images) which contain observations of trash, ROVs, and a wide variety of undersea flora and fauna. The annotations in this dataset take the format of instance segmentation annotations: bitmaps containing a mask marking which pixels in the image contain each object. The imagery in TrashCan is sourced from the J-EDI (JAMSTEC E-Library of Deep-sea Images) dataset, curated by the Japan Agency of Marine Earth Science and Technology (JAMSTEC).
8 PAPERS • NO BENCHMARKS YET
A Multi-Task 4D Radar-Camera Fusion Dataset for Autonomous Driving on Water Surfaces description of the dataset
8 PAPERS • 2 BENCHMARKS
Hi4D contains 4D textured scans of 20 subject pairs, 100 sequences, and a total of more than 11K frames. Hi4D contains rich interaction centric annotations in 2D and 3D alongside accurately registered parametric body models.
7 PAPERS • NO BENCHMARKS YET
SpaceNet 2: Building Detection v2 - is a dataset for building footprint detection in geographically diverse settings from very high resolution satellite images. It contains over 302,701 building footprints, 3/8-band Worldview-3 satellite imagery at 0.3m pixel res., across 5 cities (Rio de Janeiro, Las Vegas, Paris, Shanghai, Khartoum), and covers areas that are both urban and suburban in nature. The dataset was split using 60%/20%/20% for train/test/validation.
7 PAPERS • 1 BENCHMARK
TTPLA is a public dataset which is a collection of aerial images on Transmission Towers (TTs) and Power Lines (PLs). It can be used for detection and segmentation of transmission towers and power lines. It consists of 1,100 images with the resolution of 3,840×2,160 pixels, as well as manually labelled 8,987 instances of TTs and PLs.
Augments the KITTI with more instance pixel-level annotation for 8 categories.
5 PAPERS • 1 BENCHMARK
Satlas is a remote sensing dataset and benchmark that is large in both breadth, featuring all of the aforementioned applications and more, as well as scale, comprising 290M labels under 137 categories and 7 label modalities.
5 PAPERS • NO BENCHMARKS YET
Embrapa Wine Grape Instance Segmentation Dataset (WGISD) contains grape clusters properly annotated in 300 images and a novel annotation methodology for segmentation of complex objects in natural images.
Video sequences from a glasshouse environment in Campus Kleinaltendorf(CKA), University of Bonn, captured by PATHoBot, a glasshouse monitoring robot.
4 PAPERS • NO BENCHMARKS YET
The CropAndWeed dataset is focused on the fine-grained identification of 74 relevant crop and weed species with a strong emphasis on data variability. Annotations of labeled bounding boxes, semantic masks and stem positions are provided for about 112k instances in more than 8k high-resolution images of both real-world agricultural sites and specifically cultivated outdoor plots of rare weed types. Additionally, each sample is enriched with meta-annotations regarding environmental conditions.
The Fraunhofer IPA Bin-Picking dataset is a large-scale dataset comprising both simulated and real-world scenes for various objects (potentially having symmetries) and is fully annotated with 6D poses. A pyhsics simulation is used to create scenes of many parts in bulk by dropping objects in a random position and orientation above a bin. Additionally, this dataset extends the Siléane dataset by providing more samples. This allows to e.g. train deep neural networks and benchmark the performance on the public Siléane dataset
Northumberland Dolphin Dataset 2020 (NDD20) is a challenging image dataset annotated for both coarse and fine-grained instance segmentation and categorisation. This dataset, the first release of the NDD, was created in response to the rapid expansion of computer vision into conservation research and the production of field-deployable systems suited to extreme environmental conditions -- an area with few open source datasets. NDD20 contains a large collection of above and below water images of two different dolphin species for traditional coarse and fine-grained segmentation.
SpaceNet 1: Building Detection v1 is a dataset for building footprint detection. The data is comprised of 382,534 building footprints, covering an area of 2,544 sq. km of 3/8 band WorldView-2 imagery (0.5 m pixel res.) across the city of Rio de Janeiro, Brazil. The images are processed as 200m×200m tiles with associated building footprint vectors for training.
4 PAPERS • 2 BENCHMARKS
The Aircraft Context Dataset, a composition of two inter-compatible large-scale and versatile image datasets focusing on manned aircraft and UAVs, is intended for training and evaluating classification, detection and segmentation models in aerial domains. Additionally, a set of relevant meta-parameters can be used to quantify dataset variability as well as the impact of environmental conditions on model performance.
3 PAPERS • NO BENCHMARKS YET
DeepSportradar is a benchmark suite of computer vision tasks, datasets and benchmarks for automated sport understanding. DeepSportradar currently supports four challenging tasks related to basketball: ball 3D localization, camera calibration, player instance segmentation and player re-identification. For each of the four tasks, a detailed description of the dataset, objective, performance metrics, and the proposed baseline method are provided.
Consists of user-generated aerial videos from social media with annotations of instance-level building damage masks. This provides the first benchmark for quantitative evaluation of models to assess building damage using aerial videos.
We present a large-scale dataset for 3D urban scene understanding. Compared to existing datasets, our dataset consists of 75 outdoor urban scenes with diverse backgrounds, encompassing over 15,000 images. These scenes offer 360◦ hemispherical views, capturing diverse foreground objects illuminated under various lighting conditions. Additionally, our dataset encompasses scenes that are not limited to forward-driving views, addressing the limitations of previous datasets such as limited overlap and coverage between camera views. The closest pre-existing dataset for generalizable evaluation is DTU [2] (80 scenes) which comprises mostly indoor objects and does not provide multiple foreground objects or background scenes.
3 PAPERS • 1 BENCHMARK
Open Images is a computer vision dataset covering ~9 million images with labels spanning thousands of object categories. A subset of 1.9M includes diverse annotations types.
Real-world dataset of ~400 images of cuboid-shaped parcels with full 2D and 3D annotations in the COCO format.
RobotPush is a dataset for object singulation – the task of separating cluttered objects through physical interaction. The dataset contains 3456 training images with labels and 1024 validation images with labels. It consists of simulated and real-world data collected from a PR2 robot that equipped with a Kinect 2 camera. The dataset also contains ground truth instance segmentation masks for 110 images in the test set.
Separated COCO is automatically generated subsets of COCO val dataset, collecting separated objects for a large variety of categories in real images in a scalable manner, where target object segmentation mask is separated into distinct regions by the occluder.
Synthetic training dataset of 50,000 depth images and 320,000 object masks using simulated heaps of 3D CAD models.