KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile robotics and autonomous driving. It consists of hours of traffic scenarios recorded with a variety of sensor modalities, including high-resolution RGB, grayscale stereo cameras, and a 3D laser scanner. Despite its popularity, the dataset itself does not contain ground truth for semantic segmentation. However, various researchers have manually annotated parts of the dataset to fit their necessities. Álvarez et al. generated ground truth for 323 images from the road detection challenge with three classes: road, vertical, and sky. Zhang et al. annotated 252 (140 for training and 112 for testing) acquisitions – RGB and Velodyne scans – from the tracking challenge for ten object categories: building, sky, road, vegetation, sidewalk, car, pedestrian, cyclist, sign/pole, and fence. Ros et al. labeled 170 training images and 46 testing images (from the visual odome
3,219 PAPERS • 141 BENCHMARKS
LaSOT is a high-quality benchmark for Large-scale Single Object Tracking. LaSOT consists of 1,400 sequences with more than 3.5M frames in total. Each frame in these sequences is carefully and manually annotated with a bounding box, making LaSOT one of the largest densely annotated tracking benchmark. The average video length of LaSOT is more than 2,500 frames, and each sequence comprises various challenges deriving from the wild where target objects may disappear and re-appear again in the view.
233 PAPERS • 3 BENCHMARKS
The GOT-10k dataset contains more than 10,000 video segments of real-world moving objects and over 1.5 million manually labelled bounding boxes. The dataset contains more than 560 classes of real-world moving objects and 80+ classes of motion patterns.
203 PAPERS • 2 BENCHMARKS
The MOTChallenge datasets are designed for the task of multiple object tracking. There are several variants of the dataset released each year, such as MOT15, MOT17, MOT20.
175 PAPERS • 8 BENCHMARKS
VOT2018 is a dataset for visual object tracking. It consists of 60 challenging videos collected from real-life datasets.
124 PAPERS • 1 BENCHMARK
Virtual KITTI is a photo-realistic synthetic video dataset designed to learn and evaluate computer vision models for several video understanding tasks: object detection and multi-object tracking, scene-level and instance-level semantic segmentation, optical flow, and depth estimation.
120 PAPERS • 1 BENCHMARK
UAVDT is a large scale challenging UAV Detection and Tracking benchmark (i.e., about 80, 000 representative frames from 10 hours raw videos) for 3 important fundamental tasks, i.e., object DETection (DET), Single Object Tracking (SOT) and Multiple Object Tracking (MOT).
81 PAPERS • 1 BENCHMARK
VOT2017 is a Visual Object Tracking dataset for different tasks that contains 60 short sequences annotated with 6 different attributes.
54 PAPERS • 2 BENCHMARKS
Consists of 100 challenging video sequences captured from real-world traffic scenes (over 140,000 frames with rich annotations, including occlusion, weather, vehicle category, truncation, and vehicle bounding boxes) for object detection, object tracking and MOT system.
50 PAPERS • 1 BENCHMARK
The Event-Camera Dataset is a collection of datasets with an event-based camera for high-speed robotics. The data also include intensity images, inertial measurements, and ground truth from a motion-capture system. An event-based camera is a revolutionary vision sensor with three key advantages: a measurement rate that is almost 1 million times faster than standard cameras, a latency of 1 microsecond, and a high dynamic range of 130 decibels (standard cameras only have 60 dB). These properties enable the design of a new class of algorithms for high-speed robotics, where standard cameras suffer from motion blur and high latency. All the data are released both as text files and binary (i.e., rosbag) files.
44 PAPERS • 2 BENCHMARKS
The Objectron dataset is a collection of short, object-centric video clips, which are accompanied by AR session metadata that includes camera poses, sparse point-clouds and characterization of the planar surfaces in the surrounding environment. In each video, the camera moves around the object, capturing it from different angles. The data also contain manually annotated 3D bounding boxes for each object, which describe the object’s position, orientation, and dimensions. The dataset consists of 15K annotated video clips supplemented with over 4M annotated images in the following categories: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes. To ensure geo-diversity, the dataset is collected from 10 countries across five continents.
38 PAPERS • NO BENCHMARKS YET
TAO is a federated dataset for Tracking Any Object, containing 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average. A bottom-up approach was used for discovering a large vocabulary of 833 categories, an order of magnitude more than prior tracking benchmarks.
38 PAPERS • 1 BENCHMARK
OxUva is a dataset and benchmark for evaluating single-object tracking algorithms.
34 PAPERS • NO BENCHMARKS YET
Virtual KITTI 2 is an updated version of the well-known Virtual KITTI dataset which consists of 5 sequence clones from the KITTI tracking benchmark. In addition, the dataset provides different variants of these sequences such as modified weather conditions (e.g. fog, rain) or modified camera configurations (e.g. rotated by 15◦). For each sequence we provide multiple sets of images containing RGB, depth, class segmentation, instance segmentation, flow, and scene flow data. Camera parameters and poses as well as vehicle locations are available as well. In order to showcase some of the dataset’s capabilities, we ran multiple relevant experiments using state-of-the-art algorithms from the field of autonomous driving. The dataset is available for download at https://europe.naverlabs.com/Research/Computer-Vision/Proxy-Virtual-Worlds.
33 PAPERS • 1 BENCHMARK
The Visual Object Tracking (VOT) dataset is a collection of video sequences used for evaluating and benchmarking visual object tracking algorithms. It provides a standardized platform for researchers and practitioners to assess the performance of different tracking methods.
30 PAPERS • 7 BENCHMARKS
The Multi-Object and Segmentation (MOTS) benchmark [2] consists of 21 training sequences and 29 test sequences. It is based on the KITTI Tracking Evaluation 2012 and extends the annotations to the Multi-Object and Segmentation (MOTS) task. To this end, we added dense pixel-wise segmentation labels for every object. We evaluate submitted results using the metrics HOTA, CLEAR MOT, and MT/PT/ML. We rank methods by HOTA [1]. Our development kit and GitHub evaluation code provide details about the data format as well as utility functions for reading and writing the label files. (adapted for the segmentation case). Evaluation is performed using the code from the TrackEval repository.
26 PAPERS • 1 BENCHMARK
A new video dataset for aerial view concurrent human action detection. It consists of 43 minute-long fully-annotated sequences with 12 action classes. Okutama-Action features many challenges missing in current datasets, including dynamic transition of actions, significant changes in scale and aspect ratio, abrupt camera movement, as well as multi-labeled actors.
19 PAPERS • 1 BENCHMARK
A new large-scale dataset for understanding human motions, poses, and actions in a variety of realistic events, especially crowd & complex events. It contains a record number of poses (>1M), the largest number of action labels (>56k) for complex events, and one of the largest number of trajectories lasting for long terms (with average trajectory length >480). Besides, an online evaluation server is built for researchers to evaluate their approaches.
18 PAPERS • 1 BENCHMARK
Source: https://www.vicos.si/Projects/CDTB 4.2 State-of-the-art Comparison A TH CTB (color-and-depth visual object tracking) dataset is recorded by several passive and active RGB-D setups and contains indoor as well as outdoor sequences acquired in direct sunlight. The sequences were recorded to contain significant object pose change, clutter, occlusion, and periods of long-term target absence to enable tracker evaluation under realistic conditions. Sequences are per-frame annotated with 13 visual attributes for detailed analysis. It contains around 100,000 samples. Image Source: https://www.vicos.si/Projects/CDTB
16 PAPERS • NO BENCHMARKS YET
SeaDronesSee is a large-scale data set aimed at helping develop systems for Search and Rescue (SAR) using Unmanned Aerial Vehicles (UAVs) in maritime scenarios. Building highly complex autonomous UAV systems that aid in SAR missions requires robust computer vision algorithms to detect and track objects or persons of interest. This data set provides three sets of tracks: object detection, single-object tracking and multi-object tracking. Each track consists of its own data set and leaderboard.
16 PAPERS • 3 BENCHMARKS
A new long video dataset and benchmark for single object tracking. The dataset consists of 50 HD videos from real world scenarios, encompassing a duration of over 400 minutes (676K frames), making it more than 20 folds larger in average duration per sequence and more than 8 folds larger in terms of total covered duration, as compared to existing generic datasets for visual tracking.
14 PAPERS • NO BENCHMARKS YET
VisEvent (Visible-Event benchmark) is a dataset constructed for the evaluation of tracking by combing visible and event cameras. VisEvent is featured in:
13 PAPERS • 1 BENCHMARK
The dataset comprises 25 short sequences showing various objects in challenging backgrounds. Eight sequences are from the VOT2013 challenge (bolt, bicycle, david, diving, gymnastics, hand, sunshade, woman). The new sequences show complementary objects and backgrounds, for example a fish underwater or a surfer riding a big wave. The sequences were chosen from a large pool of sequences using a methodology based on clustering visual features of object and background so that those 25 sequences sample evenly well the existing pool.
12 PAPERS • 1 BENCHMARK
This dataset includes 4,500 fully annotated images (over 30,000 license plate characters) from 150 vehicles in real-world scenarios where both the vehicle and the camera (inside another vehicle) are moving.
11 PAPERS • 1 BENCHMARK
In this work, we propose a general dataset for Color-Event camera based Single Object Tracking, termed COESOT. It contains 1354 color-event videos with 478,721 RGB frames. We split them into a training and testing subset, which contains 827 and 527 videos, respectively. The videos are collected from both outdoor and indoor scenarios (such as the street, zoo, and home) using the DAVIS346 event camera with a zoom lens. Therefore, our videos can reflect the variation in the distance at depth, but other datasets are failed to. Different from existing benchmarks which contain limited categories, our proposed COESOT covers a wider range of object categories (90 classes), as shown in Fig. 3 (a). It mainly reflects four groups, including persons, animals, electronics, and other goods.
9 PAPERS • 1 BENCHMARK
Large-scale single-object tracking dataset, containing 108 sequences with a total length of 1.5 hours. FE108 provides ground truth annotations on both the frame and event domain. The annotation frequency is up to 40Hz and 240Hz for the frame and event domains, respectively. FE108 is the largest event-frame-based dataset for single object tracking, and also offers the highest annotation frequency in the event domain.
8 PAPERS • 1 BENCHMARK
PathTrack is a dataset for person tracking which contains more than 15,000 person trajectories in 720 sequences.
8 PAPERS • NO BENCHMARKS YET
TREK-150 is a benchmark dataset for object tracking in First Person Vision (FPV) videos composed of 150 densely annotated video sequences.
7 PAPERS • NO BENCHMARKS YET
Multi-camera Multiple People Tracking (MMPTRACK) dataset has about 9.6 hours of videos, with over half a million frame-wise annotations. The dataset is densely annotated, e.g., per-frame bounding boxes and person identities are available, as well as camera calibration parameters. Our dataset is recorded with 15 frames per second (FPS) in five diverse and challenging environment settings., e.g., retail, lobby, industry, cafe, and office. This is by far the largest publicly available multi-camera multiple people tracking dataset.
5 PAPERS • 1 BENCHMARK
PTB-TIR is a Thermal InfraRed (TIR) pedestrian tracking benchmark, which provides 60 TIR sequences with mannuly annoations. The benchmark is used to fair evaluate TIR trackers.
5 PAPERS • NO BENCHMARKS YET
VideoCube is a high-quality and large-scale benchmark to create a challenging real-world experimental environment for Global Instance Tracking (GIT). MGIT is a high-quality and multi-modal benchmark based on VideoCube-Tiny to fully represent the complex spatio-temporal and causal relationships coupled in longer narrative content.
Description The consists of 92 groups of video clips with 113, 918 high resolution frames taken by two drones and 63 groups of video clips with 145, 875 high resolution frames taken by three drones.
4 PAPERS • NO BENCHMARKS YET
Perception Test is a benchmark designed to evaluate the perception and reasoning skills of multimodal models. It introduces real-world videos designed to show perceptually interesting situations and defines multiple tasks that require understanding of memory, abstract patterns, physics, and semantics – across visual, audio, and text modalities. The benchmark consists of 11.6k videos, 23s average length, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels: object and point tracks, temporal action and sound segments, multiple-choice video question-answers and grounded video question-answers. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or fine tuning regime.
The evaluation of object detection models is usually performed by optimizing a single metric, e.g. mAP, on a fixed set of datasets, e.g. Microsoft COCO and Pascal VOC. Due to image retrieval and annotation costs, these datasets consist largely of images found on the web and do not represent many real-life domains that are being modelled in practice, e.g. satellite, microscopic and gaming, making it difficult to assert the degree of generalization learned by the model.
4 PAPERS • 1 BENCHMARK
VOT2019 is a Visual Object Tracking benchmark for short-term tracking in RGB.
Vehicle-to-Everything (V2X) network has enabled collaborative perception in autonomous driving, which is a promising solution to the fundamental defect of stand-alone intelligence including blind zones and long-range perception. However, the lack of datasets has severely blocked the development of collaborative perception algorithms. In this work, we release DOLPHINS: Dataset for cOllaborative Perception enabled Harmonious and INterconnected Self-driving, as a new simulated large-scale various-scenario multi-view multi-modality autonomous driving dataset, which provides a ground-breaking benchmark platform for interconnected autonomous driving. DOLPHINS outperforms current datasets in six dimensions: temporally-aligned images and point clouds from both vehicles and Road Side Units (RSUs) enabling both Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I) based collaborative perception; 6 typical scenarios with dynamic weather conditions make the most various interconnected auton
3 PAPERS • NO BENCHMARKS YET
The Household Object Movements from Everyday Routines (HOMER) dataset is composed of routine behaviors for five households, spanning 50 days for the train split and 10 days for test split. The households are based on an identical apartment setting with four rooms and 108 objects and 33 atomic actions such as find, grab, etc.
Twenty DAVIS recordings with a total duration of about 1.25 hour were obtained by driving the two robots in the robot arena of the University of Ulster in Londonderry.
UAV-GESTURE is a dataset for UAV control and gesture recognition. It is an outdoor recorded video dataset for UAV commanding signals with 13 gestures suitable for basic UAV navigation and command from general aircraft handling and helicopter handling signals. It contains 119 high-definition video clips consisting of 37,151 frames.
Synthetic training dataset of 50,000 depth images and 320,000 object masks using simulated heaps of 3D CAD models.
3 PAPERS • 1 BENCHMARK
The AU-AIR is a multi-modal aerial dataset captured by a UAV. Having visual data, object annotations, and flight data (time, GPS, altitude, IMU sensor data, velocities), AU-AIR meets vision and robotics for UAVs.
2 PAPERS • NO BENCHMARKS YET
Caltech Fish Counting Dataset (CFC) is a large-scale dataset for detecting, tracking, and counting fish in sonar videos. This dataset contains over 1,500 videos sourced from seven different sonar cameras.
Digital-Twin Tracking Dataset (DTTD) is a novel RGB-D dataset to enable further research of the problem and extend potential solutions towards longer ranges and mm localization accuracy. In total, 103 scenes of 10 common off-the-shelf objects with rich textures are recorded, with each frame annotated with a per-pixel semantic segmentation and ground-truth object poses provided by a commercial motion capturing system.
Informative Tracking Benchmark (ITB) is a small and informative tracking benchmark with 7% out of 1.2 M frames of existing and newly collected datasets, which enables efficient evaluation while ensuring effectiveness. Specifically, the authors designed a quality assessment mechanism to select the most informative sequences from existing benchmarks taking into account 1) challenging level, 2) discriminative strength, 3) and density of appearance variations. Furthermore, they collect additional sequences to ensure the diversity and balance of tracking scenarios, leading to a total of 20 sequences for each scenario.
2 PAPERS • 1 BENCHMARK
The Omni-MOT is realistic CARLA based large-scale dataset with over 14M frames for multiple vehicle tracking . The dataset comprises 14M+ frames, 250K tracks, 110 million bounding boxes, three weather conditions, three crowd levels and three camera views in five simulated towns.
A classification dataset of radar spectrograms in i "ground surveillance" setting recorded with the Open Radar Initiative. A dataset in a "ground surveillance" setting. The dataset has been collected with a stationary radar and targets moving in front of the radar. The dataset has been collected using both collaborative and non-collaborative targets.
The dataset is designed specifically to solve a range of computer vision problems (2D-3D tracking, posture) faced by biologists while designing behavior studies with animals.
1 PAPER • NO BENCHMARKS YET
BioDrone is the first bionic drone-based single object tracking benchmark, it features videos captured from a flapping-wing UAV system with a major camera shake due to its aerodynamics. BioDrone highlights the tracking of tiny targets with drastic changes between consecutive frames, providing a new robust vision benchmark for SOT. 1. Large-scale and high-quality benchmark with robust vision challenges 2. Rich challenging factor annotation 3. Videos from Bionic-based UAV 4. Tracking baselines with comprehensive experimental analyses