The Densely Annotation Video Segmentation dataset (DAVIS) is a high quality and high resolution densely annotated video segmentation dataset under two resolutions, 480p and 1080p. There are 50 video sequences with 3455 densely annotated frames in pixel level. 30 videos with 2079 frames are for training and 20 videos with 1376 frames are for validation.
634 PAPERS • 13 BENCHMARKS
DAVIS17 is a dataset for video object segmentation. It contains a total of 150 videos - 60 for training, 30 for validation, 60 for testing
270 PAPERS • 11 BENCHMARKS
DAVIS16 is a dataset for video object segmentation which consists of 50 videos in total (30 videos for training and 20 for testing). Per-frame pixel-wise annotations are offered.
216 PAPERS • 4 BENCHMARKS
Youtube-VOS is a Video Object Segmentation dataset that contains 4,453 videos - 3,471 for training, 474 for validation, and 508 for testing. The training and validation videos have pixel-level ground truth annotations for every 5th frame (6 fps). It also contains Instance Segmentation annotations. It has more than 7,800 unique objects, 190k high-quality manual annotations and more than 340 minutes in duration.
174 PAPERS • 10 BENCHMARKS
The Freiburg-Berkeley Motion Segmentation Dataset (FBMS-59) is an extension of the BMS dataset with 33 additional video sequences. A total of 720 frames is annotated. It has pixel-accurate segmentation annotations of moving objects. FBMS-59 comes with a split into a training set and a test set.
118 PAPERS • 3 BENCHMARKS
SegTrack v2 is a video segmentation dataset with full pixel-level annotations on multiple objects at each frame within each video.
102 PAPERS • 4 BENCHMARKS
Our task is to localize and provide a pixel-level mask of an object on all video frames given a language referring expression obtained either by looking at the first frame only or the full video. To validate our approach we employ two popular video object segmentation datasets, DAVIS16 [38] and DAVIS17 [42]. These two datasets introduce various challenges, containing videos with single or multiple salient objects, crowded scenes, similar looking instances, occlusions, camera view changes, fast motion, etc.
75 PAPERS • 5 BENCHMARKS
CoMplex video Object SEgmentation (MOSE) is a dataset to study the tracking and segmenting objects in complex environments. MOSE contains 2,149 video clips and 5,200 objects from 36 categories, with 431,725 high-quality object segmentation masks. The most notable feature of MOSE dataset is complex scenes with crowded and occluded objects.
20 PAPERS • 1 BENCHMARK
BL30K is a synthetic dataset rendered using Blender with ShapeNet's data. We break the dataset into six segments, each with approximately 5K videos. The videos are organized in a similar format as DAVIS and YouTubeVOS, so dataloaders for those datasets can be used directly. Each video is 160 frames long, and each frame has a resolution of 768*512. There are 3-5 objects per video, and each object has a random smooth trajectory -- we tried to optimize the trajectories in a greedy fashion to minimize object intersection (not guaranteed), with occlusions still possible (happen a lot in reality). See MiVOS for details.
11 PAPERS • NO BENCHMARKS YET
AVSBench is a pixel-level audio-visual segmentation benchmark that provides ground truth labels for sounding objects. The dataset is divided into three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied:
10 PAPERS • NO BENCHMARKS YET
LVOS is a dataset for long-term video object segmentation (VOS). It consists of 220 videos with a total duration of 421 minutes. The videos in our LVOS last 1.59 minutes on average, which is 20 times longer than videos in existing VOS datasets. Each video includes various attributes, especially challenges deriving from the wild, such as long-term reappearing and cross-temporal similar objects.
9 PAPERS • NO BENCHMARKS YET
VOST consists of more than 700 high-resolution videos, captured in diverse environments, which are 20 seconds long on average and densely labeled with instance masks. A careful, multi-step approach is adopted to ensure that these videos focus on complex transformations, capturing their full temporal extent.
4 PAPERS • NO BENCHMARKS YET
ARMBench is a large-scale, object-centric benchmark dataset for robotic manipulation in the context of a warehouse. ARMBench contains images, videos, and metadata that corresponds to 235K+ pick-and-place activities on 190K+ unique objects. The data is captured at different stages of manipulation, i.e., pre-pick, during transfer, and after placement.
3 PAPERS • NO BENCHMARKS YET
ODMS is a dataset for learning Object Depth via Motion and Segmentation. ODMS training data are configurable and extensible, with each training example consisting of a series of object segmentation masks, camera movement distances, and ground truth object depth. As a benchmark evaluation, the dataset provides four ODMS validation and test sets with 15,650 examples in multiple domains, including robotics and driving.
2 PAPERS • NO BENCHMARKS YET
PUMaVOS is a dataset of challenging and practical use cases inspired by the movie production industry.
1 PAPER • NO BENCHMARKS YET
VISOR is a dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video. VISOR annotates videos from EPIC-KITCHENS, and it contains 272K manual semantic masks of 257 object classes, 9.9M interpolated dense masks, and 67K hand-object relations, covering 36 hours of 179 untrimmed videos.
Infinity AI's Spills Basic Dataset is a synthetic, open-source dataset for safety applications. It features 150 videos of photorealistic liquid spills across 15 common settings. Spills take on in-context reflections, caustics, and depth based on the surrounding environment, lighting, and floor. Each video contains a spill of unique properties (size, color, profile, and more) and is accompanied by pixel-perfect labels and annotations. This dataset can be used to develop computer vision algorithms to detect the location and type of spill from the perspective of a fixed camera.
0 PAPER • NO BENCHMARKS YET