Datasets drive vision progress, yet existing driving datasets are impoverished in terms of visual content and supported tasks to study multitask learning for autonomous driving. Researchers are usually constrained to study a small set of problems on one dataset, while real-world computer vision applications require performing tasks of various complexities. We construct BDD100K, the largest driving video dataset with 100K videos and 10 tasks to evaluate the exciting progress of image recognition algorithms on autonomous driving. The dataset possesses geographic, environmental, and weather diversity, which is useful for training models that are less likely to be surprised by new conditions. Based on this diverse dataset, we build a benchmark for heterogeneous multitask learning and study how to solve the tasks together. Our experiments show that special training strategies are needed for existing models to perform such heterogeneous tasks. BDD100K opens the door for future studies in thi
359 PAPERS • 16 BENCHMARKS
VisDrone is a large-scale benchmark with carefully annotated ground-truth for various important computer vision tasks, to make vision meet drones. The VisDrone2019 dataset is collected by the AISKYEYE team at Lab of Machine Learning and Data Mining, Tianjin University, China. The benchmark dataset consists of 288 video clips formed by 261,908 frames and 10,209 static images, captured by various drone-mounted cameras, covering a wide range of aspects including location (taken from 14 different cities separated by thousands of kilometers in China), environment (urban and country), objects (pedestrian, vehicles, bicycles, etc.), and density (sparse and crowded scenes). Note that, the dataset was collected using various drone platforms (i.e., drones with different models), in different scenarios, and under various weather and lighting conditions. These frames are manually annotated with more than 2.6 million bounding boxes of targets of frequent interests, such as pedestrians, cars, bicycl
61 PAPERS • 2 BENCHMARKS
For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real images. Hypersim is a photorealistic synthetic dataset for holistic indoor scene understanding. It contains 77,400 images of 461 indoor scenes with detailed per-pixel labels and corresponding ground truth geometry.
59 PAPERS • 1 BENCHMARK
COCO-O(ut-of-distribution) contains 6 domains (sketch, cartoon, painting, weather, handmake, tattoo) of COCO objects which are hard to be detected by most existing detectors. The dataset has a total of 6,782 images and 26,624 labelled bounding boxes.
41 PAPERS • 1 BENCHMARK
The Exclusively Dark (ExDARK) dataset is a collection of 7,363 low-light images from very low-light environments to twilight (i.e 10 different conditions) with 12 object classes (similar to PASCAL VOC) annotated on both image class level and local object bounding boxes.
We introduce an object detection dataset in challenging adverse weather conditions covering 12000 samples in real-world driving scenes and 1500 samples in controlled weather conditions within a fog chamber. The dataset includes different weather conditions like fog, snow, and rain and was acquired by over 10,000 km of driving in northern Europe. The driven route with cities along the road is shown on the right. In total, 100k Objekts were labeled with accurate 2D and 3D bounding boxes. The main contributions of this dataset are: - We provide a proving ground for a broad range of algorithms covering signal enhancement, domain adaptation, object detection, or multi-modal sensor fusion, focusing on the learning of robust redundancies between sensors, especially if they fail asymmetrically in different weather conditions. - The dataset was created with the initial intention to showcase methods, which learn of robust redundancies between the sensor and enable a raw data sensor fusion in cas
30 PAPERS • 2 BENCHMARKS
Object detection benchmark for logo detection.
26 PAPERS • 3 BENCHMARKS
The Multi-Object and Segmentation (MOTS) benchmark [2] consists of 21 training sequences and 29 test sequences. It is based on the KITTI Tracking Evaluation 2012 and extends the annotations to the Multi-Object and Segmentation (MOTS) task. To this end, we added dense pixel-wise segmentation labels for every object. We evaluate submitted results using the metrics HOTA, CLEAR MOT, and MT/PT/ML. We rank methods by HOTA [1]. Our development kit and GitHub evaluation code provide details about the data format as well as utility functions for reading and writing the label files. (adapted for the segmentation case). Evaluation is performed using the code from the TrackEval repository.
26 PAPERS • 1 BENCHMARK
RADIATE (RAdar Dataset In Adverse weaThEr) is new automotive dataset created by Heriot-Watt University which includes Radar, Lidar, Stereo Camera and GPS/IMU. The data is collected in different weather scenarios (sunny, overcast, night, fog, rain and snow) to help the research community to develop new methods of vehicle perception. The radar images are annotated in 7 different scenarios: Sunny (Parked), Sunny/Overcast (Urban), Overcast (Motorway), Night (Motorway), Rain (Suburban), Fog (Suburban) and Snow (Suburban). The dataset contains 8 different types of objects (car, van, truck, bus, motorbike, bicycle, pedestrian and group of pedestrians).
19 PAPERS • 2 BENCHMARKS
Parts and Attributes of Common Objects (PACO) is a detection dataset that goes beyond traditional object boxes and masks and provides richer annotations such as part masks and attributes. It spans 75 object categories, 456 object-part categories and 55 attributes across image (LVIS) and video (Ego4D) datasets. The dataset contains 641K part masks annotated across 260K object boxes, with half of them exhaustively annotated with attributes as well.
16 PAPERS • NO BENCHMARKS YET
15 PAPERS • 2 BENCHMARKS
Prophesee’s GEN1 Automotive Detection Dataset is the largest Event-Based Dataset to date.
10 PAPERS • 1 BENCHMARK
The SARDet-100K dataset encompasses a total of 116,598 images, and 245,653 instances distributed across six categories: Aircraft, Ship, Car, Bridge, Tank, and Harbor. SARDet100K dataset stands as the first large-scale SAR object detection dataset, comparable in size to the widely used COCO dataset (118K images). The scale and diversity of the SARDet-100K dataset provide researchers with robust training and evaluation for advancing SAR object detection algorithms and techniques, fostering the development of SOTA models in this domain.
9 PAPERS • 1 BENCHMARK
CeyMo is a novel benchmark dataset for road marking detection which covers a wide variety of challenging urban, sub-urban and rural road scenarios. The dataset consists of 2887 total images of 1920 × 1080 resolution with 4706 road marking instances belonging to 11 classes. The test set is divided into six categories: normal, crowded, dazzle light, night, rain and shadow.
6 PAPERS • 1 BENCHMARK
With the advance of AI, road object detection has been a prominent topic in computer vision, mostly using perspective cameras. Fisheye lens provides omnidirectional wide coverage for using fewer cameras to monitor road intersections, however with view distortions. The dataset will be available on the GitHub (https://github.com/MoyoG/FishEye8K) with PASCAL VOC, MS COCO, and YOLO annotation formats.
5 PAPERS • 1 BENCHMARK
FES is an indoor dataset that can be used for evaluation of deep learning approaches. It consists of 301 top-view fisheye images from an indoor scene. Annotations include bounding boxes and instance segmentation masks for 6 classes.
4 PAPERS • NO BENCHMARKS YET
The evaluation of object detection models is usually performed by optimizing a single metric, e.g. mAP, on a fixed set of datasets, e.g. Microsoft COCO and Pascal VOC. Due to image retrieval and annotation costs, these datasets consist largely of images found on the web and do not represent many real-life domains that are being modelled in practice, e.g. satellite, microscopic and gaming, making it difficult to assert the degree of generalization learned by the model.
4 PAPERS • 1 BENCHMARK
VEDAI is a dataset for Vehicle Detection in Aerial Imagery, provided as a tool to benchmark automatic target recognition algorithms in unconstrained environments. The vehicles contained in the database, in addition of being small, exhibit different variabilities such as multiple orientations, lighting/shadowing changes, specularities or occlusions. Furthermore, each image is available in several spectral bands and resolutions. A precise experimental protocol is also given, ensuring that the experimental results obtained by different people can be properly reproduced and compared. We also give the performance of some baseline algorithms on this dataset, for different settings of these algorithms, to illustrate the difficulties of the task and provide baseline comparisons.
Vehicle-to-Everything (V2X) network has enabled collaborative perception in autonomous driving, which is a promising solution to the fundamental defect of stand-alone intelligence including blind zones and long-range perception. However, the lack of datasets has severely blocked the development of collaborative perception algorithms. In this work, we release DOLPHINS: Dataset for cOllaborative Perception enabled Harmonious and INterconnected Self-driving, as a new simulated large-scale various-scenario multi-view multi-modality autonomous driving dataset, which provides a ground-breaking benchmark platform for interconnected autonomous driving. DOLPHINS outperforms current datasets in six dimensions: temporally-aligned images and point clouds from both vehicles and Road Side Units (RSUs) enabling both Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I) based collaborative perception; 6 typical scenarios with dynamic weather conditions make the most various interconnected auton
3 PAPERS • NO BENCHMARKS YET
Open Images is a computer vision dataset covering ~9 million images with labels spanning thousands of object categories. A subset of 1.9M includes diverse annotations types.
Throughout the history of art, the pose—as the holistic abstraction of the human body's expression—has proven to be a constant in numerous studies. However, due to the enormous amount of data that so far had to be processed by hand, its crucial role to the formulaic recapitulation of art-historical motifs since antiquity could only be highlighted selectively. This is true even for the now automated estimation of human poses, as domain-specific, sufficiently large data sets required for training computational models are either not publicly available or not indexed at a fine enough granularity. With the Poses of People in Art data set, we introduce the first openly licensed data set for estimating human poses in art and validating human pose estimators. It consists of 2,454 images from 22 art-historical depiction styles, including those that have increasingly turned away from lifelike representations of the body since the 19th century. A total of 10,749 human figures are precisely enclos
3 PAPERS • 1 BENCHMARK
✔️Abstract A Brain tumor is considered as one of the aggressive diseases, among children and adults. Brain tumors account for 85 to 90 percent of all primary Central Nervous System (CNS) tumors. Every year, around 11,700 people are diagnosed with a brain tumor. The 5-year survival rate for people with a cancerous brain or CNS tumor is approximately 34 percent for men and36 percent for women. Brain Tumors are classified as: Benign Tumor, Malignant Tumor, Pituitary Tumor, etc. Proper treatment, planning, and accurate diagnostics should be implemented to improve the life expectancy of the patients. The best technique to detect brain tumors is Magnetic Resonance Imaging (MRI). A huge amount of image data is generated through the scans. These images are examined by the radiologist. A manual examination can be error-prone due to the level of complexities involved in brain tumors and their properties. Application of automated classification techniques using Machine Learning (ML) and Artificia
2 PAPERS • NO BENCHMARKS YET
Synthetic dataset of over 13,000 images of damaged and intact parcels with full 2D and 3D annotations in the COCO format. For details see our paper and for visual samples our project page.
SODA-D is a large-scale dataset tailored for small object detection in driving scenario, which is built on top of MVD dataset and owned data, where the former is a dataset dedicated to pixel-level understanding of street scenes, and the latter is mainly captured by onboard cameras and mobile phones. With 24704 well-chosen and high-quality images of driving scenarios, SODA-D comprises 277596 instances of 9 categories with horizontal bounding boxes.
2 PAPERS • 1 BENCHMARK
The dataset of Thermal Bridges on Building Rooftops (TBBR dataset) consists of annotated combined RGB and thermal drone images with a height map. All images were converted to a uniform format of 3000$\times$4000 pixels, aligned, and cropped to 2400$\times$3400 to remove empty borders.
2 PAPERS • 2 BENCHMARKS
This dataset contains 9 different seafood types collected from a supermarket in Izmir, Turkey for a university-industry collaboration project at Izmir University of Economics, and this work was published in ASYU 2020. The dataset includes gilt head bream, red sea bream, sea bass, red mullet, horse mackerel, black sea sprat, striped red mullet, trout, shrimp image samples.
1 PAPER • NO BENCHMARKS YET
This dataset contains a collection of 131 X-ray CT scans of pieces of modeling clay (Play-Doh) with various numbers of stones inserted, retrieved in the FleX-ray lab at CWI. The dataset consists of 5 parts. It is intended as raw supplementary material to reproduce the CT reconstructions and subsequent results in the paper titled "A tomographic workflow enabling deep learning for X-ray based foreign object detection". The dataset can be used to set up other CT-based experiments concerning similar objects with variations in shape and composition.
This dataset contains a collection of 235800 X-ray projections of 131 pieces of modeling clay (Play-Doh) with various numbers of stones inserted. The dataset is intended as an extensive and easy-to-use training dataset for supervised machine learning driven object detection. The ground truth locations of the stones are included.
Overview This is a dataset of blood cells photos.
This dataset contains both the artificial and real flower images of bramble flowers. The real images were taken with a realsense D435 camera inside the West Virginia University greenhouse. All the flowers are annotated in YOLO format with bounding box and class name. The trained weights after training also have been provided. They can be used with the python script provided to detect the bramble flowers. Also the classifier can classify whether the flowers center is visible or hidden which will be helpful in precision pollination projects. Images are also augmented to make the task robust in various environmental conditions.
The CLCXray dataset contains 9,565 X-ray images, in which 4,543 X-ray images (real data) are obtained from the real subway scene and 5,022 X-ray images (simulated data) are scanned from manually designed baggage. There are 12 categories in the CLCXray dataset, including 5 types of cutters and 7 types of liquid containers. Five kinds of cutters include blade, dagger, knife, scissors, swiss army knife. Seven kinds of liquid containers include cans, carton drinks, glass bottle, plastic bottle, vacuum cup, spray cans, tin. The annotations are made in COCO format.
1 PAPER • 1 BENCHMARK
COCO-OOC goes beyond standard object detection to ask the question: Which objects are out-of-context (OOC)? Given an image with a set of objects, the goal of COCO-OOC is to determine if an object is inconsistent with the contextual relations, where it must detect the OOC object with a bounding box.
Existing image/video datasets for cattle behavior recognition are mostly small, lack well-defined labels, or are collected in unrealistic controlled environments. This limits the utility of machine learning (ML) models learned from them. Therefore, we introduce a new dataset, called Cattle Visual Behaviors (CVB), that consists of 502 video clips, each fifteen seconds long, captured in natural lighting conditions, and annotated with eleven visually perceptible behaviors of grazing cattle. By creating and sharing CVB, our aim is to develop improved models capable of recognizing all important cattle behaviors accurately and to assist other researchers and practitioners in developing and evaluating new ML models for cattle behavior classification using video data. The dataset is presented in the form of following three sub-directories. 1. raw_frames: contains 450 frames in each sub folder representing a 15 second video taken at a frame rate of 30 FPS. 2. annotations: contains the json file
In this dataset an uppertorso humanoid robot with 7-DOF arm explored 100 different objects belonging to 20 different categories using 10 behaviors: Look, Crush, Grasp, Hold, Lift, Drop, Poke, Push, Shake and Tap.
The Chess Recognition Dataset 2K (ChessReD2K) comprises a diverse collection of images of chess formations captured using smartphone cameras; a sensor choice made to ensure real-world applicability. The dataset is accompanied by detailed annotations providing information about the chess pieces formation in the images, bounding-boxes, and chessboard corner annotations. The number of annotations for each image depends on the number of chess pieces depicted in it. There are 12 category ids in total (i.e., 6 piece types per colour) and the chessboard coordinates are in the form of algebraic notation strings (e.g., "a8"). The corners are annotated based on their location on the chessboard (e.g., "bottom-left") with respect to the white player's view. This discrimination between these different types of corners provides information about the orientation of the chessboard that can be leveraged to determine the image's perspective and viewing angle.
A dataset of 100K synthetic images of skin lesions, ground-truth (GT) segmentations of lesions and healthy skin, GT segmentations of seven body parts (head, torso, hips, legs, feet, arms and hands), and GT binary masks of non-skin regions in the texture maps of 215 scans from the 3DBodyTex.v1 dataset [2], [3] created using the framework described in [1]. The dataset is primarily intended to enable the development of skin lesion analysis methods. Synthetic image creation consisted of two main steps. First, skin lesions from the Fitzpatrick 17k dataset were blended onto skin regions of high-resolution three-dimensional human scans from the 3DBodyTex dataset [2], [3]. Second, two-dimensional renders of the modified scans were generated.
Paper: GridTracer: Automatic Mapping of Power Grids using Deep Learning and Overhead Imagery
This is the Infrared Elephant Images Dataset (named 'EleThermal dataset') collected from here and annotated by our project, released under GPLv3. Therefore, if you use the annotated 'EleThermal' dataset for any research or other product by any means, please acknowledge the following two works by citing them.
The FathomNet2023 competition dataset is a subset of the broader FathomNet marine image repository. The training and test images for the competition were all collected in the Monterey Bay Area between the surface and 1300 meters depth by the Monterey Bay Aquarium Research Institute. The images contain bounding box annotations of 290 categories of bottom dwelling animals. The training and validation data are split across an 800 meter depth threshold: all training data is collected from 0-800 meters, evaluation data comes from the whole 0-1300 meter range. Since an organisms' habitat range is partially a function of depth, the species distributions in the two regions are overlapping but not identical. Test images are drawn from the same region but may come from above or below the depth horizon. The competition goal is to label the animals present in a given image (i.e. multi-label classification) and determine whether the image is out-of-sample.
The HRPlanesv2 dataset contains 2120 VHR Google Earth images. To further improve experiment results, images of airports from many different regions with various uses (civil/military/joint) selected and labeled. A total of 14,335 aircrafts have been labelled. Each image is stored as a ".jpg" file of size 4800 x 2703 pixels and each label is stored as YOLO ".txt" format. Dataset has been split in three parts as 70% train, %20 validation and test. The aircrafts in the images in the train and validation datasets have a percentage of 80 or more in size. Link: https://github.com/dilsadunsal/HRPlanesv2-Data-Set
The image comes from the CCD camera of the highway measurement vehicle. Cracks and sealed cracks have been labeled. The form of labels is different from traditional block annotations, but uses redundant and dense annotation boxes. Some of the data is manually annotated, while others are model generated annotations that have undergone careful manual inspection.
InfraParis is a novel and versatile dataset supporting multiple tasks across three modalities: RGB, depth, and infrared. From the city to the suburbs, it contains a variety of styles in different areas of the greater Paris area, providing rich semantic information. InfraParis contains 7301 images with bounding boxes and full semantic (19 classes) annotations. We assess various state-of-the-art baseline techniques, encompassing models for the tasks of semantic segmentation, object detection, and depth estimation.
It contains grayscale mono and stereo images (NavCam and LocCam) from laboratory tests performed by a prototype rover on a martian-like testbed. The dataset can be used for artificial sample-tube detection and pose estimation. It also contains synthetic color images of the sample tube on a martian scenario created with Unreal Engine.
The dataset, comprising 1204 meticulously curated images, serves as a comprehensive resource for advancing real-time mosquito detection models. The dataset is strategically divided into training, validation, and test sets, accounting for 87%, 8%, and 5% of the images, respectively. A rigorous preprocessing phase involves auto-orientation and resizing to standardize dimensions at 640x640 pixels. To ensure dataset integrity, the filter null criterion mandates that all images must contain annotations. Augmentations, including flips, rotations, crops, and grayscale applications, enhance the dataset's diversity, fostering robust model training. With a focus on quality and variety, this dataset provides a solid foundation for evaluating and enhancing real-time mosquito detection models.
The data consists of 21 images of microtubules in PFA-fixed NIH 3T3 mouse embryonic fibroblasts (DSMZ: ACC59) labeled with a mouse anti-alpha-tubulin monoclonal IgG1 antibody (Thermofisher A11126, primary antibody) and visualized by a blue-fluorescent Alexa Fluor 405 goat anti-mouse IgG antibody (Thermofisher A-31553, secondary antibody). Acquisition of the images was performed using a confocal microscope (Olympus IX81).
Automating the creation of catalogues for radio galaxies in next-generation deep surveys necessitates the identification of components within extended sources and their respective infrared hosts. We present RadioGalaxyNET, a multimodal dataset, tailored for machine learning tasks to streamline the automated detection and localization of multi-component extended radio galaxies and their associated infrared hosts. The dataset encompasses 4,155 instances of galaxies across 2,800 images, incorporating both radio and infrared channels. Each instance furnishes details about the extended radio galaxy class, a bounding box covering all components, a pixel-level segmentation mask, and the keypoint position of the corresponding infrared host galaxy. RadioGalaxyNET is the first dataset to include images from the highly sensitive Australian Square Kilometre Array Pathfinder (ASKAP) radio telescope, corresponding infrared images, and instance-level annotations for galaxy detection.
Our dataset augments the TAO dataset with amodal bounding box annotations for fully invisible, out-of-frame, and occluded objects. Note that this implies TAO-Amodal also includes modal segmentation masks (as visualized in the color overlays above). Our dataset encompasses 880 categories, aimed at assessing the occlusion reasoning capabilities of current trackers through the paradigm of Tracking Any Object with Amodal perception (TAO-Amodal).
TRR360D is based on the ICDAR2019MTD modern table detection dataset, it refers to the annotation format of the DOTA dataset. The training set contains 600 rotated images and 977 annotated instances, and the test set contains 240 rotated images and 499 annotated instances.