HowTo100M is a large-scale dataset of narrated videos with an emphasis on instructional videos where content creators teach complex tasks with an explicit intention of explaining the visual content on screen. HowTo100M features a total of:
241 PAPERS • 1 BENCHMARK
The TVQA dataset is a large-scale video dataset for video question answering. It is based on 6 popular TV shows (Friends, The Big Bang Theory, How I Met Your Mother, House M.D., Grey's Anatomy, Castle). It includes 152,545 QA pairs from 21,793 TV show clips. The QA pairs are split into the ratio of 8:1:1 for training, validation, and test sets. The TVQA dataset provides the sequence of video frames extracted at 3 FPS, the corresponding subtitles with the video clips, and the query consisting of a question and four answer candidates. Among the four answer candidates, there is only one correct answer.
116 PAPERS • 3 BENCHMARKS
The ActivityNet-QA dataset contains 58,000 human-annotated QA pairs on 5,800 videos derived from the popular ActivityNet dataset. The dataset provides a benchmark for testing the performance of VideoQA models on long-term spatio-temporal reasoning.
87 PAPERS • 2 BENCHMARKS
The MovieQA dataset is a dataset for movie question answering. to evaluate automatic story comprehension from both video and text. The data set consists of almost 15,000 multiple choice question answers obtained from over 400 movies and features high semantic diversity. Each question comes with a set of five highly plausible answers; only one of which is correct. The questions can be answered using multiple sources of information: movie clips, plots, subtitles, and for a subset scripts and DVS.
79 PAPERS • 1 BENCHMARK
The TGIF-QA dataset contains 165K QA pairs for the animated GIFs from the TGIF dataset [Li et al. CVPR 2016]. The question & answer pairs are collected via crowdsourcing with a carefully designed user interface to ensure quality. The dataset can be used to evaluate video-based Visual Question Answering techniques.
77 PAPERS • 6 BENCHMARKS
NExT-QA is a VideoQA benchmark targeting the explanation of video contents. It challenges QA models to reason about the causal and temporal actions and understand the rich object interactions in daily activities. It supports both multi-choice and open-ended QA tasks. The videos are untrimmed and the questions usually invoke local video contents for answers.
74 PAPERS • 3 BENCHMARKS
The MSR-VTT-QA dataset is a benchmark for the task of Visual Question Answering (VQA) on the MSR-VTT (Microsoft Research Video to Text) dataset. The MSR-VTT-QA benchmark is used to evaluate models on their ability to answer questions based on these videos. It's part of the tasks that this dataset is used for, along with Video Retrieval, Video Captioning, Zero-Shot Video Question Answering, Zero-Shot Video Retrieval, and Text-to-Video Generation.
55 PAPERS • 6 BENCHMARKS
TVQA+ contains 310.8K bounding boxes, linking depicted objects to visual concepts in questions and answers.
46 PAPERS • NO BENCHMARKS YET
The Tumblr GIF (TGIF) dataset contains 100K animated GIFs and 120K sentences describing visual content of the animated GIFs. The animated GIFs have been collected from Tumblr, from randomly selected posts published between May and June of 2015. The dataset provides the URLs of animated GIFs. The sentences are collected via crowdsourcing, with a carefully designed annotation interface that ensures high quality dataset. There is one sentence per animated GIF for the training and validation splits, and three sentences per GIF for the test split. The dataset can be used to evaluate animated GIF/video description techniques.
45 PAPERS • 1 BENCHMARK
VALUE is a Video-And-Language Understanding Evaluation benchmark to test models that are generalizable to diverse tasks, domains, and datasets. It is an assemblage of 11 VidL (video-and-language) datasets over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; and (iii) video captioning. VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels. Rather than focusing on single-channel videos with visual information only, VALUE promotes models that leverage information from both video frames and their associated subtitles, as well as models that share knowledge across multiple tasks.
25 PAPERS • NO BENCHMARKS YET
To collect How2QA for video QA task, the same set of selected video clips are presented to another group of AMT workers for multichoice QA annotation. Each worker is assigned with one video segment and asked to write one question with four answer candidates (one correctand three distractors). Similarly, narrations are hidden from the workers to ensure the collected QA pairs are not biased by subtitles. Similar to TVQA, the start and end points are provided for the relevant moment for each question. After filtering low-quality annotations, the final dataset contains 44,007 QA pairs for 22k 60-second clips selected from 9035 videos.
22 PAPERS • 2 BENCHMARKS
An open-ended VideoQA benchmark that aims to: i) provide a well-defined evaluation by including five correct answer annotations per question and ii) avoid questions which can be answered without the video.
Action Genome Question Answering (AGQA) is a benchmark for compositional spatio-temporal reasoning. AGQA contains 192M unbalanced question answer pairs for 9.6K videos. It also contains a balanced subset of 3.9M question answer pairs, 3 orders of magnitude larger than existing benchmarks, that minimizes bias by balancing the answer distributions and types of question structures.
20 PAPERS • NO BENCHMARKS YET
Video Instruction Dataset is used to train Video-ChatGPT. It consists of 100,000 high-quality video instruction pairs. employs a combination of human-assisted and semi-automatic annotation techniques, aiming to produce high-quality video instruction data. These methods create question-answer pairs related to
17 PAPERS • 6 BENCHMARKS
SUTD-TrafficQA (Singapore University of Technology and Design - Traffic Question Answering) is a dataset which takes the form of video QA based on 10,080 in-the-wild videos and annotated 62,535 QA pairs, for benchmarking the cognitive capability of causal inference and event understanding models in complex traffic scenarios. Specifically, the dataset proposes 6 challenging reasoning tasks corresponding to various traffic scenarios, so as to evaluate the reasoning capability over different kinds of complex yet practical traffic events.
16 PAPERS • 1 BENCHMARK
MVBench is a comprehensive Multi-modal Video understanding Benchmark. It was introduced to evaluate the comprehension capabilities of Multi-modal Large Language Models (MLLMs), particularly their temporal understanding in dynamic video tasks. MVBench covers 20 challenging video tasks that cannot be effectively solved with a single frame. It introduces a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, it enables the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition.
15 PAPERS • 1 BENCHMARK
The MSRVTT-MC (Multiple Choice) dataset is a video question-answering dataset created based on the MSR-VTT dataset. It consists of 2,990 questions generated from 10,000 video clips with associated ground truth captions. For each question, there are five candidate captions, including the ground truth caption and four randomly sampled negative choices. The objective of the dataset is to choose the correct answer from the five candidate captions.
14 PAPERS • 2 BENCHMARKS
How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence. STAR Benchmark is a novel benchmark for Situated Reasoning, which provides 60K challenging situated questions in four types of tasks, 140K situated hypergraphs, symbolic situation programs, and logic-grounded diagnosis for real-world video situations. (Data Download, STAR Leaderboard)
14 PAPERS • 3 BENCHMARKS
The DramaQA focuses on two perspectives: 1) Hierarchical QAs as an evaluation metric based on the cognitive developmental stages of human intelligence. 2) Character-centered video annotations to model local coherence of the story. The dataset is built upon the TV drama "Another Miss Oh" and it contains 17,983 QA pairs from 23,928 various length video clips, with each QA pair belonging to one of four difficulty levels.
10 PAPERS • 3 BENCHMARKS
We contribute an IntentQA dataset with diverse intents in daily social activities.
10 PAPERS • 1 BENCHMARK
VLEP contains 28,726 future event prediction examples (along with their rationales) from 10,234 diverse TV Show and YouTube Lifestyle Vlog video clips. Each example (see Figure 1) consists of a Premise Event (a short video clip with dialogue), a Premise Summary (a text summary of the premise event), and two potential natural language Future Events (along with Rationales) written by people. These clips are on average 6.1 seconds long and are harvested from diverse event-rich sources, i.e., TV show and YouTube Lifestyle Vlog videos.
KnowIT VQA is a video dataset with 24,282 human-generated question-answer pairs about The Big Bang Theory. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered.
8 PAPERS • 1 BENCHMARK
EgoTask QA benchmark contains 40K balanced question-answer pairs selected from 368K programmatically generated questions generated over 2K egocentric videos. It provides a single home for the crucial dimensions of task understanding through question-answering on real-world egocentric videos.
7 PAPERS • 1 BENCHMARK
A quantitative benchmark for developing and understanding video of fill-in-the-blank question-answering dataset with over 300,000 examples, based on descriptive video annotations for the visually impaired.
6 PAPERS • NO BENCHMARKS YET
A dataset of 69,270,581 video clip, question and answer triplets (v, q, a). HowToVQA69M is two orders of magnitude larger than any of the currently available VideoQA datasets.
5 PAPERS • NO BENCHMARKS YET
TutorialVQA is a new type of dataset used to find answer spans in tutorial videos. The dataset includes about 6,000 triples, comprised of videos, questions, and answer spans manually collected from screencast tutorial videos with spoken narratives for a photo-editing software.
Perception Test is a benchmark designed to evaluate the perception and reasoning skills of multimodal models. It introduces real-world videos designed to show perceptually interesting situations and defines multiple tasks that require understanding of memory, abstract patterns, physics, and semantics – across visual, audio, and text modalities. The benchmark consists of 11.6k videos, 23s average length, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels: object and point tracks, temporal action and sound segments, multiple-choice video question-answers and grounded video question-answers. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or fine tuning regime.
4 PAPERS • NO BENCHMARKS YET
Video Localized Narratives is a new form of multimodal video annotations connecting vision and language. The annotations are created from videos with Localized Narratives, capturing even complex events involving multiple actors interacting with each other and with several passive objects. It contains annotations of 20k videos of the OVIS, UVO, and Oops datasets, totalling 1.7M words.
CRIPP-VQA is a video question answering dataset for reasoning about the implicit physical properties of objects in a scene. It contains videos of object in motion, annotated with questions that involve counterfactual reasoning about actions, questions about planning in order to reach a goal, and descriptive questions about visible properties of objects.
2 PAPERS • 8 BENCHMARKS
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
1 PAPER • NO BENCHMARKS YET
WildQA is a video understanding dataset of videos recorded in outside settings. The dataset can be used to evaluate models for video question answering.
1 PAPER • 1 BENCHMARK
We construct a fine-grained video-text dataset with 12K annotated high-resolution videos (~400k clips). The annotation of this dataset is inspired by the video script. If we want to make a video, we have to first write a script to organize how to shoot the scenes in the videos. To shoot a scene, we need to decide the content, shot type (medium shot, close-up, etc), and how the camera moves (panning, tilting, etc). Therefore, we extend video captioning to video scripting by annotating the videos in the format of video scripts. Different from the previous video-text datasets, we densely annotate the entire videos without discarding any scenes and each scene has a caption with ~145 words. Besides the vision modality, we transcribe the voice-over into text and put it along with the video title to give more background information for annotating the videos.
0 PAPER • NO BENCHMARKS YET