The Kinetics dataset is a large-scale, high-quality dataset for human action recognition in videos. The dataset consists of around 500,000 video clips covering 600 human action classes with at least 600 video clips for each action class. Each video clip lasts around 10 seconds and is labeled with a single action class. The videos are collected from YouTube.
1,180 PAPERS • 28 BENCHMARKS
The Charades dataset is composed of 9,848 videos of daily indoors activities with an average length of 30 seconds, involving interactions with 46 objects classes in 15 types of indoor scenes and containing a vocabulary of 30 verbs leading to 157 action classes. Each video in this dataset is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacting objects. 267 different users were presented with a sentence, which includes objects and actions from a fixed vocabulary, and they recorded a video acting out the sentence. In total, the dataset contains 66,500 temporal annotations for 157 action classes, 41,104 labels for 46 object classes, and 27,847 textual descriptions of the videos. In the standard split there are7,986 training video and 1,863 validation video.
383 PAPERS • 6 BENCHMARKS
Charades-STA is a new dataset built on top of Charades by adding sentence temporal annotations.
184 PAPERS • 4 BENCHMARKS
The ShanghaiTech Campus dataset has 13 scenes with complex light conditions and camera angles. It contains 130 abnormal events and over 270, 000 training frames. Moreover, both the frame-level and pixel-level ground truth of abnormal events are annotated in this dataset.
164 PAPERS • 4 BENCHMARKS
AVA is a project that provides audiovisual annotations of video for improving our understanding of human activity. Each of the video clips has been exhaustively annotated by human annotators, and together they represent a rich variety of scenes, recording conditions, and expressions of human activity. There are annotations for:
94 PAPERS • 7 BENCHMARKS
A novel large-scale corpus of manual annotations for the SoccerNet video dataset, along with open challenges to encourage more research in soccer understanding and broadcast production.
45 PAPERS • 7 BENCHMARKS
MovieNet is a holistic dataset for movie understanding. MovieNet contains 1,100 movies with a large amount of multi-modal data, e.g. trailers, photos, plot descriptions, etc.. Besides, different aspects of manual annotations are provided in MovieNet, including 1.1M characters with bounding boxes and identities, 42K scene boundaries, 2.5K aligned description sentences, 65K tags of place and action, and 92 K tags of cinematic style.
41 PAPERS • NO BENCHMARKS YET
The EPIC-KITCHENS-55 dataset comprises a set of 432 egocentric videos recorded by 32 participants in their kitchens at 60fps with a head mounted camera. There is no guiding script for the participants who freely perform activities in kitchens related to cooking, food preparation or washing up among others. Each video is split into short action segments (mean duration is 3.7s) with specific start and end times and a verb and noun annotation describing the action (e.g. ‘open fridge‘). The verb classes are 125 and the noun classes 331. The dataset is divided into one train and two test splits.
35 PAPERS • 3 BENCHMARKS
A new multitask action quality assessment (AQA) dataset, the largest to date, comprising of more than 1600 diving samples; contains detailed annotations for fine-grained action recognition, commentary generation, and estimating the AQA score. Videos from multiple angles provided wherever available.
25 PAPERS • 2 BENCHMARKS
Contains 68,536 activity instances in 68.8 hours of first and third-person video, making it one of the largest and most diverse egocentric datasets available. Charades-Ego furthermore shares activity classes, scripts, and methodology with the Charades dataset, that consist of additional 82.3 hours of third-person video with 66,500 activity instances.
24 PAPERS • 1 BENCHMARK
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodAL understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
14 PAPERS • NO BENCHMARKS YET
How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence. STAR Benchmark is a novel benchmark for Situated Reasoning, which provides 60K challenging situated questions in four types of tasks, 140K situated hypergraphs, symbolic situation programs, and logic-grounded diagnosis for real-world video situations. (Data Download, STAR Leaderboard)
14 PAPERS • 3 BENCHMARKS
HVU is organized hierarchically in a semantic taxonomy that focuses on multi-label and multi-task video understanding as a comprehensive problem that encompasses the recognition of multiple semantic aspects in the dynamic scene. HVU contains approx.~572k videos in total with 9 million annotations for training, validation, and test set spanning over 3142 labels. HVU encompasses semantic aspects defined on categories of scenes, objects, actions, events, attributes, and concepts which naturally captures the real-world scenarios.
13 PAPERS • NO BENCHMARKS YET
VidSitu is a dataset for the task of semantic role labeling in videos (VidSRL). It is a large-scale video understanding data source with 29K 10-second movie clips richly annotated with a verb and semantic-roles every 2 seconds. Entities are co-referenced across events within a movie clip and events are connected to each other via event-event relations. Clips in VidSitu are drawn from a large collection of movies (∼3K) and have been chosen to be both complex (∼4.2 unique verbs within a video) as well as diverse (∼200 verbs have more than 100 annotations each).
12 PAPERS • NO BENCHMARKS YET
Aesthetic Visual Analysis is a dataset for aesthetic image assessment that contains over 250,000 images along with a rich variety of meta-data including a large number of aesthetic scores for each image, semantic labels for over 60 categories as well as labels related to photographic style.
11 PAPERS • 3 BENCHMARKS
A large-scale dataset for retrieval and event localisation in video. A unique feature of the dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description of the visual content.
11 PAPERS • 1 BENCHMARK
Car Crash Dataset (CCD) is collected for traffic accident analysis. It contains real traffic accident videos captured by dashcam mounted on driving vehicles, which is critical to developing safety-guaranteed self-driving systems. CCD is distinguished from existing datasets for diversified accident annotations, including environmental attributes (day/night, snowy/rainy/good weather conditions), whether ego-vehicles involved, accident participants, and accident reason descriptions.
10 PAPERS • 1 BENCHMARK
The DramaQA focuses on two perspectives: 1) Hierarchical QAs as an evaluation metric based on the cognitive developmental stages of human intelligence. 2) Character-centered video annotations to model local coherence of the story. The dataset is built upon the TV drama "Another Miss Oh" and it contains 17,983 QA pairs from 23,928 various length video clips, with each QA pair belonging to one of four difficulty levels.
10 PAPERS • 3 BENCHMARKS
Given 10 minimally contrastive (highly similar) images and a complex description for one of them, the task is to retrieve the correct image. The source of most images are videos and descriptions as well as retrievals come from human.
8 PAPERS • 1 BENCHMARK
The MLB-YouTube dataset is a new, large-scale dataset consisting of 20 baseball games from the 2017 MLB post-season available on YouTube with over 42 hours of video footage. The dataset consists of two components: segmented videos for activity recognition and continuous videos for activity classification. It is quite challenging as it is created from TV broadcast baseball games where multiple different activities share the camera angle. Further, the motion/appearance difference between the various activities is quite small.
5 PAPERS • NO BENCHMARKS YET
This dataset has the following citation: M. Soliman, M. Kamal, M. Nashed, Y. Mostafa, B. Chawky, D. Khattab, “ Violence Recognition from Videos using Deep Learning Techniques”, Proc. 9th International Conference on Intelligent Computing and Information Systems (ICICIS'19), Cairo, pp. 79-84, 2019. please use it in case of using the dataset in research or engineering purpose ) when we start our Graduation Project Violence Recognition from Videos we found that there is shortage in available datasets related to violence between individuals so we decide to create new big dataset with variety of scenes
5 PAPERS • 1 BENCHMARK
Comprises of 171,191 video segments from 346 high-quality soccer games. The database contains 702,096 bounding boxes, 37,709 essential event labels with time boundary and 17,115 highlight annotations for object detection, action recognition, temporal action localization, and highlight detection tasks.
Contains ~9K videos of human agents performing various actions, annotated with 3 types of commonsense descriptions.
Collects dense per-video-shot concept annotations.
4 PAPERS • 1 BENCHMARK
DeepSportradar is a benchmark suite of computer vision tasks, datasets and benchmarks for automated sport understanding. DeepSportradar currently supports four challenging tasks related to basketball: ball 3D localization, camera calibration, player instance segmentation and player re-identification. For each of the four tasks, a detailed description of the dataset, objective, performance metrics, and the proposed baseline method are provided.
3 PAPERS • NO BENCHMARKS YET
Moviescope is a large-scale dataset of 5,000 movies with corresponding video trailers, posters, plots and metadata. Moviescope is based on the IMDB 5000 dataset consisting of 5.043 movie records. It is augmented by crawling video trailers associated with each movie from YouTube and text plots from Wikipedia.
A multilingual, multimodal and multi-aspect, expertly-annotated dataset of diverse short videos extracted from short-video social media platform - Moj. 3MASSIV comprises of 50k short videos (~20 seconds average duration) and 100K unlabeled videos in 11 different languages and captures popular short video trends like pranks, fails, romance, comedy expressed via unique audio-visual formats like self-shot videos, reaction videos, lip-synching, self-sung songs, etc.
2 PAPERS • NO BENCHMARKS YET
A novel dataset that represents complex conversational interactions between two individuals via 3D pose. 8 pairwise interactions describing 7 separate conversation based scenarios were collected using two Kinect depth sensors.
Contains annotations of human activity with different sub-actions, e.g., activity Ping-Pong with four sub-actions which are pickup-ball, hit, bounce-ball and serve.
Largest, first-of-its-kind, in-the-wild, fine-grained workout/exercise posture analysis dataset, covering three different exercises: BackSquat, Barbell Row, and Overhead Press. Seven different types of exercise errors are covered. Unlabeled data is also provided to facilitate self-supervised learning.
Stanford-ECM is an egocentric multimodal dataset which comprises about 27 hours of egocentric video augmented with heart rate and acceleration data. The lengths of the individual videos cover a diverse range from 3 minutes to about 51 minutes in length. A mobile phone was used to collect egocentric video at 720x1280 resolution and 30 fps, as well as triaxial acceleration at 30Hz. The mobile phone was equipped with a wide-angle lens, so that the horizontal field of view was enlarged from 45 degrees to about 64 degrees. A wrist-worn heart rate sensor was used to capture the heart rate every 5 seconds. The phone and heart rate monitor was time-synchronized through Bluetooth, and all data was stored in the phone’s storage. Piecewise cubic polynomial interpolation was used to fill in any gaps in heart rate data. Finally, data was aligned to the millisecond level at 30 Hz.
VTC is a large-scale multimodal dataset containing video-caption pairs (~300k) alongside comments that can be used for multimodal representation learning.
The feature files are named with the youtube IDs. https://drive.google.com/drive/folders/10-6hkQxMKMGwLXANxfPRE7xw5PKiMjLn?usp=sharing
1 PAPER • NO BENCHMARKS YET
We provide a database containing shot scale annotations (i.e., the apparent distance of the camera from the subject of a filmed scene) for more than 792,000 image frames. Frames belong to 124 full movies from the entire filmographies by 6 important directors: Martin Scorsese, Jean-Luc Godard, Béla Tarr, Federico Fellini, Michelangelo Antonioni, and Ingmar Bergman. Each frame, extracted from videos at 1 frame per second, is annotated on the following scale categories: Extreme Close Up (ECU), Close Up (CU), Medium Close Up (MCU), Medium Shot (MS), Medium Long Shot (MLS), Long Shot (LS), Extreme Long Shot (ELS), Foreground Shot (FS), and Insert Shots (IS). Two independent coders annotated all frames from the 124 movies, whilst a third one checked their coding and made decisions in cases of disagreement. The CineScale database enables AI-driven interpretation of shot scale data and opens to a large set of research activities related to the automatic visual analysis of cinematic material, s
This spatio-temporal actions dataset for video understanding consists of 4 parts: original videos, cropped videos, video frames, and annotation files. This dataset uses a proposed new multi-person annotation method of spatio-temporal actions. First, we use ffmpeg to crop the videos and frame the videos; then use yolov5 to detect human in the video frame, and then use deep sort to detect the ID of the human in the video frame. By processing the detection results of yolov5 and deep sort, we can get the annotation file of the spatio-temporal action dataset to complete the work of customizing the spatio-temporal action dataset.
Kinetics-GEB+ (Generic Event Boundary Captioning, Grounding and Retrieval) is a dataset that consists of over 170k boundaries associated with captions describing status changes in the generic events in 12K videos.
1 PAPER • 3 BENCHMARKS
Placepedia contains 240K places with 35M images from all over the world. Each place is associated with its district, city/town/village, state/province, country, continent, and a large amount of diverse photos. Both administrative areas and places have rich side information, e.g. discription, population, category, function. In addition, two cleaned subsets (Places-Coarse and Places-Fine) for experiments are provided.
Trailers12k is a movie trailer dataset comprised of 12,000 titles associated to ten genres. It distinguishes from other datasets by its collection procedure aimed at providing a high-quality publicly available dataset.
WildQA is a video understanding dataset of videos recorded in outside settings. The dataset can be used to evaluate models for video question answering.
1 PAPER • 1 BENCHMARK
First of its kind paired win-fail action understanding dataset with samples from the following domains: “General Stunts,” “Internet Wins-Fails,” “Trick Shots,” & “Party Games.” The task is to identify successful and failed attempts at various activities. Unlike existing action recognition datasets, intra-class variation is high making the task challenging, yet feasible.
1 PAPER • 2 BENCHMARKS
The i3-video dataset contains "is-it-instructional" annotations for 6.4k videos from Youtube-8M. The videos are considered to be instructional if they focus on real-world human actions accompanied by procedural language that explains what’s happening on screen in reasonable details.
We construct a fine-grained video-text dataset with 12K annotated high-resolution videos (~400k clips). The annotation of this dataset is inspired by the video script. If we want to make a video, we have to first write a script to organize how to shoot the scenes in the videos. To shoot a scene, we need to decide the content, shot type (medium shot, close-up, etc), and how the camera moves (panning, tilting, etc). Therefore, we extend video captioning to video scripting by annotating the videos in the format of video scripts. Different from the previous video-text datasets, we densely annotate the entire videos without discarding any scenes and each scene has a caption with ~145 words. Besides the vision modality, we transcribe the voice-over into text and put it along with the video title to give more background information for annotating the videos.
0 PAPER • NO BENCHMARKS YET