The Kinetics dataset is a large-scale, high-quality dataset for human action recognition in videos. The dataset consists of around 500,000 video clips covering 600 human action classes with at least 600 video clips for each action class. Each video clip lasts around 10 seconds and is labeled with a single action class. The videos are collected from YouTube.
1,180 PAPERS • 28 BENCHMARKS
The ActivityNet dataset contains 200 different types of activities and a total of 849 hours of videos collected from YouTube. ActivityNet is the largest benchmark for temporal activity detection to date in terms of both the number of activity categories and number of videos, making the task particularly challenging. Version 1.3 of the dataset contains 19994 untrimmed videos in total and is divided into three disjoint subsets, training, validation, and testing by a ratio of 2:1:1. On average, each activity category has 137 untrimmed videos. Each video on average has 1.41 activities which are annotated with temporal boundaries. The ground-truth annotations of test videos are not public.
688 PAPERS • 18 BENCHMARKS
MSR-VTT (Microsoft Research Video to Text) is a large-scale dataset for the open domain video captioning, which consists of 10,000 video clips from 20 categories, and each video clip is annotated with 20 English sentences by Amazon Mechanical Turks. There are about 29,000 unique words in all captions. The standard splits uses 6,513 clips for training, 497 clips for validation, and 2,990 clips for testing.
523 PAPERS • 7 BENCHMARKS
The Microsoft Research Video Description Corpus (MSVD) dataset consists of about 120K sentences collected during the summer of 2010. Workers on Mechanical Turk were paid to watch a short video snippet and then summarize the action in a single sentence. The result is a set of roughly parallel descriptions of more than 2,000 video snippets. Because the workers were urged to complete the task in the language of their choice, both paraphrase and bilingual alternations are captured in the data.
281 PAPERS • 3 BENCHMARKS
HowTo100M is a large-scale dataset of narrated videos with an emphasis on instructional videos where content creators teach complex tasks with an explicit intention of explaining the visual content on screen. HowTo100M features a total of:
241 PAPERS • 1 BENCHMARK
Charades-STA is a new dataset built on top of Charades by adding sentence temporal annotations.
184 PAPERS • 4 BENCHMARKS
The Distinct Describable Moments (DiDeMo) dataset is one of the largest and most diverse datasets for the temporal localization of events in videos given natural language descriptions. The videos are collected from Flickr and each video is trimmed to a maximum of 30 seconds. The videos in the dataset are divided into 5-second segments to reduce the complexity of annotation. The dataset is split into training, validation and test sets containing 8,395, 1,065 and 1,004 videos respectively. The dataset contains a total of 26,892 moments and one moment could be associated with descriptions from multiple annotators. The descriptions in DiDeMo dataset are detailed and contain camera movement, temporal transition indicators, and activities. Moreover, the descriptions in DiDeMo are verified so that each description refers to a single moment.
182 PAPERS • 3 BENCHMARKS
WebVid contains 10 million video clips with captions, sourced from the web. The videos are diverse and rich in their content.
172 PAPERS • 1 BENCHMARK
YouCook2 is the largest task-oriented, instructional video dataset in the vision community. It contains 2000 long untrimmed videos from 89 cooking recipes; on average, each distinct recipe has 22 videos. The procedure steps for each video are annotated with temporal boundaries and described by imperative English sentences (see the example below). The videos were downloaded from YouTube and are all in the third-person viewpoint. All the videos are unconstrained and can be performed by individual persons at their houses with unfixed cameras. YouCook2 contains rich recipe types and various cooking styles from all over the world.
158 PAPERS • 7 BENCHMARKS
This dataset contains 118,081 short video clips extracted from 202 movies. Each video has a caption, either extracted from the movie script or from transcribed DVS (descriptive video services) for the visually impaired. The validation set contains 7408 clips and evaluation is performed on a test set of 1000 videos from movies disjoint from the training and val sets.
113 PAPERS • 4 BENCHMARKS
VATEX is multilingual, large, linguistically complex, and diverse dataset in terms of both video and natural language descriptions. It has two tasks for video-and-language research: (1) Multilingual Video Captioning, aimed at describing a video in various languages with a compact unified captioning model, and (2) Video-guided Machine Translation, to translate a source language description into the target language using the video information as additional spatiotemporal context.
95 PAPERS • 3 BENCHMARKS
The Tumblr GIF (TGIF) dataset contains 100K animated GIFs and 120K sentences describing visual content of the animated GIFs. The animated GIFs have been collected from Tumblr, from randomly selected posts published between May and June of 2015. The dataset provides the URLs of animated GIFs. The sentences are collected via crowdsourcing, with a carefully designed annotation interface that ensures high quality dataset. There is one sentence per animated GIF for the training and validation splits, and three sentences per GIF for the test split. The dataset can be used to evaluate animated GIF/video description techniques.
45 PAPERS • 1 BENCHMARK
A new multimodal retrieval dataset. TVR requires systems to understand both videos and their associated subtitle (dialogue) texts, making it more realistic. The dataset contains 109K queries collected on 21.8K videos from 6 TV shows of diverse genres, where each query is associated with a tight temporal window.
27 PAPERS • 3 BENCHMARKS
To collect How2QA for video QA task, the same set of selected video clips are presented to another group of AMT workers for multichoice QA annotation. Each worker is assigned with one video segment and asked to write one question with four answer candidates (one correctand three distractors). Similarly, narrations are hidden from the workers to ensure the collected QA pairs are not biased by subtitles. Similar to TVQA, the start and end points are provided for the relevant moment for each question. After filtering low-quality annotations, the final dataset contains 44,007 QA pairs for 22k 60-second clips selected from 9035 videos.
22 PAPERS • 2 BENCHMARKS
Video-and-Language Inference is the task of joint multimodal understanding of video and text. Given a video clip with aligned subtitles as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip. The Violin dataset is a dataset for this task which consists of 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video. These video clips contain rich content with diverse temporal dynamics, event shifts, and people interactions, collected from two sources: (i) popular TV shows, and (ii) movie clips from YouTube channels.
17 PAPERS • NO BENCHMARKS YET
The FIVR-200K dataset has been collected to simulate the problem of Fine-grained Incident Video Retrieval (FIVR). The dataset comprises 225,960 videos associated with 4,687 Wikipedia events and 100 selected video queries.
15 PAPERS • 1 BENCHMARK
TV show Caption is a large-scale multimodal captioning dataset, containing 261,490 caption descriptions paired with 108,965 short video moments. TVC is unique as its captions may also describe dialogues/subtitles while the captions in the other datasets are only describing the visual content.
Consists of the key scenes from over 3K movies: each key scene is accompanied by a high level semantic description of the scene, character face-tracks, and metadata about the movie. The dataset is scalable, obtained automatically from YouTube, and is freely available for anybody to download and use.
11 PAPERS • NO BENCHMARKS YET
A large-scale video dataset, featuring clips from movies with detailed captions.
11 PAPERS • 1 BENCHMARK
A large-scale dataset for retrieval and event localisation in video. A unique feature of the dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description of the visual content.
Amazon Mechanical Turk (AMT) is used to collect annotations on HowTo100M videos. 30k 60-second clips are randomly sampled from 9,421 videos and present each clip to the turkers, who are asked to select a video segment containing a single, self-contained scene. After this segment selection step, another group of workers are asked to write descriptions for each displayed segment. Narrations are not provided to the workers to ensure that their written queries are based on visual content only. These final video segments are 10-20 seconds long on average, and the length of queries ranges from 8 to 20 words. From this process, 51,390 queries are collected for 24k 60-second clips from 9,371 videos in HowTo100M, on average 2-3 queries per clip. The video clips and its associated queries are split into 80% train, 10% val and 10% test.
9 PAPERS • NO BENCHMARKS YET
The dataset contains single-shot videos taken from moving cameras in underwater environments. The first shard of a new Marine Video Kit dataset is presented to serve for video retrieval and other computer vision challenges. In addition to basic meta-data statistics, we present several insights based on low-level features as well as semantic annotations of selected keyframes. 1379 videos with a length from 2 s to 4.95 min, with the mean and median duration of each video is 29.9 s, and 25.4 s, respectively. We capture data from 11 different regions and countries during the time from 2011 to 2022.
7 PAPERS • 1 BENCHMARK
HiREST (HIerarchical REtrieval and STep-captioning) dataset is a benchmark that covers hierarchical information retrieval and visual/textual stepwise summarization from an instructional video corpus. It consists of 3.4K text-video pairs from a video dataset, where 1.1K videos have annotations of moment spans relevant to text query and breakdown of each moment into key instruction steps with caption and timestamps (totaling 8.6K step captions). The dataset consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
2 PAPERS • NO BENCHMARKS YET
VTC is a large-scale multimodal dataset containing video-caption pairs (~300k) alongside comments that can be used for multimodal representation learning.
The IAW dataset contains 420 Ikea furniture pieces from 14 common categories e.g. sofa, bed, wardrobe, table, etc. Each piece of furniture comes with one or more user instruction manuals, which are first divided into pages and then further divided into independent steps cropped from each page (some pages contain more than one step and some pages do not contain instructions). There are 8568 pages and 8263 steps overall, on average 20.4 pages and 19.7 steps for each piece of furniture. We crawled YouTube to find videos corresponding to these instruction manuals and as such the conditions in the videos are diverse on many aspects e.g. duration, resolution, first- or third-person view, camera pose, background environment, number of assemblers, etc. The IAW dataset contains 1005 raw videos with a length of around 183 hours in total. Among them, approximately 114 hours of content are labeled as 15649 actions to match the corresponding step in the corresponding manual.
1 PAPER • NO BENCHMARKS YET
Kinetics-GEB+ (Generic Event Boundary Captioning, Grounding and Retrieval) is a dataset that consists of over 170k boundaries associated with captions describing status changes in the generic events in 12K videos.
1 PAPER • 3 BENCHMARKS
MSVD-Indonesian is derived from the MSVD dataset, which is obtained with the help of a machine translation service. This dataset can be used for multimodal video-text tasks, including text-to-video retrieval, video-to-text retrieval, and video captioning. Same as the original English dataset, the MSVD-Indonesian dataset contains about 80k video-text pairs.
1 PAPER • 4 BENCHMARKS
STVD is the largest public dataset on the PVCD task. It was constituted with about 83 thousands of videos having in total of more than 10 thousands of hours duration and including more than 420 thousands of video copy pairs. It offers different test sets for a fine performance characterization (frame degradation, global transformation, video speeding, etc.) with a frame level annotation for the real-time detection and video alignment. Baseline comparisons were reported to show a room for improvement. More information about the STVD dataset can be found into the publications [1, 2].
1 PAPER • 1 BENCHMARK
A short clip of video may contain progression of multiple events and an interesting story line. A human needs to capture both the event in every shot and associate them together to understand the story behind it.
VILT is a new benchmark collection of tasks and multimodal video content. The video linking collection includes annotations from 10 (recipe) tasks, which the annotators chose from a random subset of the collection of 2,275 high-quality 'Wholefoods' recipes. There are linking annotations for 61 query steps across these tasks which contain cooking techniques, chosen from the 189 total recipe steps. As each method results in approximately 10 videos to annotate, the collection consists of 831 linking judgments.
First of its kind paired win-fail action understanding dataset with samples from the following domains: “General Stunts,” “Internet Wins-Fails,” “Trick Shots,” & “Party Games.” The task is to identify successful and failed attempts at various activities. Unlike existing action recognition datasets, intra-class variation is high making the task challenging, yet feasible.
1 PAPER • 2 BENCHMARKS