The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.
10,147 PAPERS • 92 BENCHMARKS
KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile robotics and autonomous driving. It consists of hours of traffic scenarios recorded with a variety of sensor modalities, including high-resolution RGB, grayscale stereo cameras, and a 3D laser scanner. Despite its popularity, the dataset itself does not contain ground truth for semantic segmentation. However, various researchers have manually annotated parts of the dataset to fit their necessities. Álvarez et al. generated ground truth for 323 images from the road detection challenge with three classes: road, vertical, and sky. Zhang et al. annotated 252 (140 for training and 112 for testing) acquisitions – RGB and Velodyne scans – from the tracking challenge for ten object categories: building, sky, road, vegetation, sidewalk, car, pedestrian, cyclist, sign/pole, and fence. Ros et al. labeled 170 training images and 46 testing images (from the visual odome
3,219 PAPERS • 141 BENCHMARKS
PASCAL VOC 2007 is a dataset for image recognition. The twenty object classes that have been selected are:
119 PAPERS • 14 BENCHMARKS
The Mall is a dataset for crowd counting and profiling research. Its images are collected from publicly accessible webcam. It mainly includes 2,000 video frames, and the head position of every pedestrian in all frames is annotated. A total of more than 60,000 pedestrians are annotated in this dataset.
63 PAPERS • 1 BENCHMARK
Honda Research Institute Driving Dataset (HDD) is a dataset to enable research on learning driver behavior in real-life environments. The dataset includes 104 hours of real human driving in the San Francisco Bay Area collected using an instrumented vehicle equipped with different sensors.
37 PAPERS • NO BENCHMARKS YET
Contains 51,583 descriptions of 11,046 objects from 800 ScanNet scenes. ScanRefer is the first large-scale effort to perform object localization via natural language expression directly in 3D.
The SIXray dataset is constructed by the Pattern Recognition and Intelligent System Development Laboratory, University of Chinese Academy of Sciences. It contains 1,059,231 X-ray images which are collected from some several subway stations. There are six common categories of prohibited items, namely, gun, knife, wrench, pliers, scissors and hammer. It has three subsets called SIXray10, SIXray100 and SIXray1000, There are image-level annotations provided by human security inspectors for the whole dataset. In addition the images in the test set are annotated with a bounding-box for each prohibited item to evaluate the performance of object localization.
29 PAPERS • 1 BENCHMARK
UVO is a new benchmark for open-world class-agnostic object segmentation in videos. Besides shifting the problem focus to the open-world setup, UVO is significantly larger, providing approximately 8 times more videos compared with DAVIS, and 7 times more mask (instance) annotations per video compared with YouTube-VOS and YouTube-VIS. UVO is also more challenging as it includes many videos with crowded scenes and complex background motions. Some highlights of the dataset include:
23 PAPERS • 3 BENCHMARKS
The General Robust Image Task (GRIT) Benchmark is an evaluation-only benchmark for evaluating the performance and robustness of vision systems across multiple image prediction tasks, concepts, and data sources. GRIT hopes to encourage our research community to pursue the following research directions:
13 PAPERS • 8 BENCHMARKS
The evaluation of object detection models is usually performed by optimizing a single metric, e.g. mAP, on a fixed set of datasets, e.g. Microsoft COCO and Pascal VOC. Due to image retrieval and annotation costs, these datasets consist largely of images found on the web and do not represent many real-life domains that are being modelled in practice, e.g. satellite, microscopic and gaming, making it difficult to assert the degree of generalization learned by the model.
4 PAPERS • 1 BENCHMARK
FractureAtlas is a musculoskeletal bone fracture dataset with annotations for deep learning tasks like classification, localization, and segmentation. The dataset contains a total of 4,083 X-Ray images with annotation in COCO, VGG, YOLO, and Pascal VOC format. This dataset is made freely available for any purpose. The data provided within this work are free to copy, share or redistribute in any medium or format. The data might be adapted, remixed, transformed, and built upon. The dataset is licensed under a CC-BY 4.0 license. It should be noted that to use the dataset correctly, one needs to have knowledge of medical and radiology fields to understand the results and make conclusions based on the dataset. It's also important to consider the possibility of labeling errors.
1 PAPER • NO BENCHMARKS YET
The HRPlanesv2 dataset contains 2120 VHR Google Earth images. To further improve experiment results, images of airports from many different regions with various uses (civil/military/joint) selected and labeled. A total of 14,335 aircrafts have been labelled. Each image is stored as a ".jpg" file of size 4800 x 2703 pixels and each label is stored as YOLO ".txt" format. Dataset has been split in three parts as 70% train, %20 validation and test. The aircrafts in the images in the train and validation datasets have a percentage of 80 or more in size. Link: https://github.com/dilsadunsal/HRPlanesv2-Data-Set
IllusionVQA is a Visual Question Answering (VQA) dataset with two sub-tasks. The first task tests comprehension on 435 instances in 12 optical illusion categories. Each instance consists of an image with an optical illusion, a question, and 3 to 6 options, one of which is the correct answer. We refer to this task as Logo IllusionVQA-Comprehension. The second task tests how well VLMs can differentiate geometrically impossible objects from ordinary objects when two objects are presented side by side. The task consists of 1000 instances following a similar format to the first task. We refer to this task as Logo IllusionVQA-Soft-Localization.
1 PAPER • 2 BENCHMARKS
SolarDK is a dataset for the detection and localization of solar. It comprises images from GeoDanmark with a variable Ground Sample Distance (GSD) between 10 cm and 15 cm, all sampled between March 1st and May 1st during 2021, containing 23,417 hand labelled images for classification and 880 segmentation masks, in addition to a set of about 100,000+ images for classification covering most variations of Danish urban and rural landscapes.
Dataset built from partial reconstructions of real-world indoor scenes using RGB-D sequences from ScanNet, aimed at estimating the unknown position of an object (e.g. where is the bag?) given a partial 3D scan of a scene. The dataset mostly consists of bedrooms, bathrooms, and living rooms. Some room types like closet and gym only have a few instances.
The VGG Cell dataset (made up entirely of synthetic images) is the main public benchmark used to compare cell counting techniques.
Infinity AI's Spills Basic Dataset is a synthetic, open-source dataset for safety applications. It features 150 videos of photorealistic liquid spills across 15 common settings. Spills take on in-context reflections, caustics, and depth based on the surrounding environment, lighting, and floor. Each video contains a spill of unique properties (size, color, profile, and more) and is accompanied by pixel-perfect labels and annotations. This dataset can be used to develop computer vision algorithms to detect the location and type of spill from the perspective of a fixed camera.
0 PAPER • NO BENCHMARKS YET