The Waymo Open Dataset is comprised of high resolution sensor data collected by autonomous vehicles operated by the Waymo Driver in a wide variety of conditions.
373 PAPERS • 12 BENCHMARKS
The EPIC-KITCHENS-55 dataset comprises a set of 432 egocentric videos recorded by 32 participants in their kitchens at 60fps with a head mounted camera. There is no guiding script for the participants who freely perform activities in kitchens related to cooking, food preparation or washing up among others. Each video is split into short action segments (mean duration is 3.7s) with specific start and end times and a verb and noun annotation describing the action (e.g. ‘open fridge‘). The verb classes are 125 and the noun classes 331. The dataset is divided into one train and two test splits.
35 PAPERS • 3 BENCHMARKS
ImageNet VID is a large-scale public dataset for video object detection and contains more than 1M frames for training and more than 100k frames for validation.
24 PAPERS • 1 BENCHMARK
Prophesee’s GEN1 Automotive Detection Dataset is the largest Event-Based Dataset to date.
10 PAPERS • 1 BENCHMARK
YouTube-BoundingBoxes (YT-BB) is a large-scale data set of video URLs with densely-sampled object bounding box annotations. The data set consists of approximately 380,000 video segments about 19s long, automatically selected to feature objects in natural settings without editing or post-processing, with a recording quality often akin to that of a hand-held cell phone camera. The objects represent a subset of the MS COCO label set. All video segments were human-annotated with high-precision classification labels and bounding boxes at 1 frame per second.
7 PAPERS • 1 BENCHMARK
Specially designed to evaluate active learning for video object detection in road scenes.
4 PAPERS • NO BENCHMARKS YET
OAK is a dataset for online continual object detection benchmark with an egocentric video dataset. OAK adopts the KrishnaCam videos, an ego-centric video stream collected over nine months by a graduate student. OAK provides exhaustive bounding box annotations of 80 video snippets (~17.5 hours) for 105 object categories in outdoor scenes.
3 PAPERS • NO BENCHMARKS YET
Temporal Hands Guns and Phones (THGP) dataset, is a collection of 5960 video frames (5000 for training and 960 for testing). The training part is composed with 50 videos of 100 frames (720 × 720 pixels). This dataset contains 20 videos of shooting drills, 20 videos of armed robberies, and 10 videos of people making calls. The testing part contains 48 videos of 20 frames (720 × 720). Videos contained in the testing dataset includes phone calls, gun reviews, shooting drills, people making calls, and armed robberies at convenience stores. This dataset is labeled with the bounding boxes of hands, phones, and guns.
1 PAPER • NO BENCHMARKS YET
USC-GRAD-STDdb comprises 115 video segments containing more than 25,000 annotated frames of HD 720p resolution (≈1280x720) with small objects of interest from 16 (≈4x4) to 256 (≈16x16) as pixel area. The length of the videos changes from 150 up to 500 frames. The size of every object is determined through the bounding box, so that a good annotation is of utmost importance for reliable performance metrics. As it may seem obvious, the smaller the object, the harder the annotation. The annotation has been carried out with the ViTBAT tool, adjusting the boxes as much as possible to the objects of interest in each video frame. In total, more than 56,000 ground truth labels have been generated.
1 PAPER • 1 BENCHMARK
VISEM-Tracking is a dataset consisting of 20 video recordings of 30s of spermatozoa with manually annotated bounding-box coordinates and a set of sperm characteristics analyzed by experts in the domain. It is an extension of the previously published VISEM dataset. In addition to the annotated data, unlabeled video clips are provided for easy-to-use access and analysis of the data.