The Kinetics dataset is a large-scale, high-quality dataset for human action recognition in videos. The dataset consists of around 500,000 video clips covering 600 human action classes with at least 600 video clips for each action class. Each video clip lasts around 10 seconds and is labeled with a single action class. The videos are collected from YouTube.
1,180 PAPERS • 28 BENCHMARKS
The Densely Annotation Video Segmentation dataset (DAVIS) is a high quality and high resolution densely annotated video segmentation dataset under two resolutions, 480p and 1080p. There are 50 video sequences with 3455 densely annotated frames in pixel level. 30 videos with 2079 frames are for training and 20 videos with 1376 frames are for validation.
634 PAPERS • 13 BENCHMARKS
Object Tracking Benchmark (OTB) is a visual tracking benchmark that is widely used to evaluate the performance of a visual tracking algorithm. The dataset contains a total of 100 sequences and each is annotated frame-by-frame with bounding boxes and 11 challenge attributes. OTB-2013 dataset contains 51 sequences and the OTB-2015 dataset contains all 100 sequences of the OTB dataset.
394 PAPERS • 4 BENCHMARKS
LaSOT is a high-quality benchmark for Large-scale Single Object Tracking. LaSOT consists of 1,400 sequences with more than 3.5M frames in total. Each frame in these sequences is carefully and manually annotated with a bounding box, making LaSOT one of the largest densely annotated tracking benchmark. The average video length of LaSOT is more than 2,500 frames, and each sequence comprises various challenges deriving from the wild where target objects may disappear and re-appear again in the view.
233 PAPERS • 3 BENCHMARKS
TrackingNet is a large-scale tracking dataset consisting of videos in the wild. It has a total of 30,643 videos split into 30,132 training videos and 511 testing videos, with an average of 470,9 frames.
181 PAPERS • 2 BENCHMARKS
OTB-2015, also referred as Visual Tracker Benchmark, is a visual tracking dataset. It contains 100 commonly used video sequences for evaluating visual tracking. Image Source: http://cvlab.hanyang.ac.kr/tracker_benchmark/datasets.html
174 PAPERS • 1 BENCHMARK
OTB2013 is the previous version of the current OTB2015 Visual Tracker Benchmark. It contains only 50 tracking sequences, as opposed to the 100 sequences in the current version of the benchmark.
110 PAPERS • 2 BENCHMARKS
Tracking by Natural Language (TNL2K) is constructed for the evaluation of tracking by natural language specification. TNL2K features:
43 PAPERS • 2 BENCHMARKS
Kubric is a data generation pipeline for creating semi-realistic synthetic multi-object videos with rich annotations such as instance segmentation masks, depth maps, and optical flow.
42 PAPERS • 1 BENCHMARK
OxUva is a dataset and benchmark for evaluating single-object tracking algorithms.
34 PAPERS • NO BENCHMARKS YET
The Visual Object Tracking (VOT) dataset is a collection of video sequences used for evaluating and benchmarking visual object tracking algorithms. It provides a standardized platform for researchers and practitioners to assess the performance of different tracking methods.
30 PAPERS • 7 BENCHMARKS
TAP-Vid is a benchmark which contains both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks. This is designed for a new task called tracking any point.
22 PAPERS • 1 BENCHMARK
Source: https://www.vicos.si/Projects/CDTB 4.2 State-of-the-art Comparison A TH CTB (color-and-depth visual object tracking) dataset is recorded by several passive and active RGB-D setups and contains indoor as well as outdoor sequences acquired in direct sunlight. The sequences were recorded to contain significant object pose change, clutter, occlusion, and periods of long-term target absence to enable tracker evaluation under realistic conditions. Sequences are per-frame annotated with 13 visual attributes for detailed analysis. It contains around 100,000 samples. Image Source: https://www.vicos.si/Projects/CDTB
16 PAPERS • NO BENCHMARKS YET
A new long video dataset and benchmark for single object tracking. The dataset consists of 50 HD videos from real world scenarios, encompassing a duration of over 400 minutes (676K frames), making it more than 20 folds larger in average duration per sequence and more than 8 folds larger in terms of total covered duration, as compared to existing generic datasets for visual tracking.
14 PAPERS • NO BENCHMARKS YET
The dataset comprises 25 short sequences showing various objects in challenging backgrounds. Eight sequences are from the VOT2013 challenge (bolt, bicycle, david, diving, gymnastics, hand, sunshade, woman). The new sequences show complementary objects and backgrounds, for example a fish underwater or a surfer riding a big wave. The sequences were chosen from a large pool of sequences using a methodology based on clustering visual features of object and background so that those 25 sequences sample evenly well the existing pool.
12 PAPERS • 1 BENCHMARK
RGB-Stacking is a benchmark for vision-based robotic manipulation. The robot is trained to learn how to grasp objects and balance them on top of one another.
11 PAPERS • 3 BENCHMARKS
YouTube-BoundingBoxes (YT-BB) is a large-scale data set of video URLs with densely-sampled object bounding box annotations. The data set consists of approximately 380,000 video segments about 19s long, automatically selected to feature objects in natural settings without editing or post-processing, with a recording quality often akin to that of a hand-held cell phone camera. The objects represent a subset of the MS COCO label set. All video segments were human-annotated with high-precision classification labels and bounding boxes at 1 frame per second.
7 PAPERS • 1 BENCHMARK
Multi-camera Multiple People Tracking (MMPTRACK) dataset has about 9.6 hours of videos, with over half a million frame-wise annotations. The dataset is densely annotated, e.g., per-frame bounding boxes and person identities are available, as well as camera calibration parameters. Our dataset is recorded with 15 frames per second (FPS) in five diverse and challenging environment settings., e.g., retail, lobby, industry, cafe, and office. This is by far the largest publicly available multi-camera multiple people tracking dataset.
5 PAPERS • 1 BENCHMARK
Description The consists of 92 groups of video clips with 113, 918 high resolution frames taken by two drones and 63 groups of video clips with 145, 875 high resolution frames taken by three drones.
4 PAPERS • NO BENCHMARKS YET
Estimating camera motion in deformable scenes poses a complex and open research challenge. Most existing non-rigid structure from motion techniques assume to observe also static scene parts besides deforming scene parts in order to establish an anchoring reference. However, this assumption does not hold true in certain relevant application cases such as endoscopies. To tackle this issue with a common benchmark, we introduce the Drunkard’s Dataset, a challenging collection of synthetic data targeting visual navigation and reconstruction in deformable environments. This dataset is the first large set of exploratory camera trajectories with ground truth inside 3D scenes where every surface exhibits non-rigid deformations over time. Simulations in realistic 3D buildings lets us obtain a vast amount of data and ground truth labels, including camera poses, RGB images and depth, optical flow and normal maps at high resolution and quality.
1 PAPER • 1 BENCHMARK
Dataset originally conceived for multi-face tracking/detection for highly crowded scenarios. In these scenarios, the face is the only part that can be used to track the individuals.
1 PAPER • NO BENCHMARKS YET
MobiFace is the first dataset for single face tracking in mobile situations. It consists of 80 unedited live-streaming mobile videos captured by 70 different smartphone users in fully unconstrained environments. Over 95K bounding boxes are manually labelled. The videos are carefully selected to cover typical smartphone usage. The videos are also annotated with 14 attributes, including 6 newly proposed attributes and 8 commonly seen in object tracking.
This dataset contains nine video sequences captured by a webcam for salient closed boundary tracking evaluation. Each sequence is about 30 sec (30 fps) and the frame size is 640×480 (width×height). There are 9598 frames in total. In each sequence, different motion styles such as translation, rotation and viewpoint changing are all performed.
SurgT is a dataset for benchmarking 2D Trackers in Minimally Invasive Surgery (MIS). It contains a total of 157 stereo endoscopic videos from 20 clinical cases, along with stereo camera calibration parameters.
The dataset is composed of 100 video sequences densely annotated with 60K bounding boxes, 17 sequence attributes, 13 action verb attributes and 29 target object attributes.
ARKitTrack is a new RGB-D tracking dataset for both static and dynamic scenes captured by consumer-grade LiDAR scanners equipped on Apple's iPhone and iPad. ARKitTrack contains 300 RGBD sequences, 455 targets, and 229.7K video frames in total. This dataset has 123.9K pixel-level target masks along with the bounding box annotations and frame-level attributes.
0 PAPER • NO BENCHMARKS YET