Action Detection
233 papers with code • 11 benchmarks • 33 datasets
Action Detection aims to find both where and when an action occurs within a video clip and classify what the action is taking place. Typically results are given in the form of action tublets, which are action bounding boxes linked across time in the video. This is related to temporal localization, which seeks to identify the start and end frame of an action, and action recognition, which seeks only to classify which action is taking place and typically assumes a trimmed video.
Libraries
Use these libraries to find Action Detection models and implementationsDatasets
Subtasks
Most implemented papers
Continuous control with deep reinforcement learning
We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain.
BSN: Boundary Sensitive Network for Temporal Action Proposal Generation
Temporal action proposal generation is an important yet challenging problem, since temporal proposals with rich action content are indispensable for analysing real-world videos with long duration and high proportion irrelevant content.
SlowFast Networks for Video Recognition
We present SlowFast networks for video recognition.
BMN: Boundary-Matching Network for Temporal Action Proposal Generation
To address these difficulties, we introduce the Boundary-Matching (BM) mechanism to evaluate confidence scores of densely distributed proposals, which denote a proposal as a matching pair of starting and ending boundaries and combine all densely distributed BM pairs into the BM confidence map.
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.
Rescaling Egocentric Vision
This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS.
Temporal Action Detection with Structured Segment Networks
Detecting actions in untrimmed videos is an important yet challenging task.
CholecTriplet2021: A benchmark challenge for surgical action triplet recognition
In this paper, we present the challenge setup and assessment of the state-of-the-art deep learning methods proposed by the participants during the challenge.
HAKE: Human Activity Knowledge Engine
To address these and promote the activity understanding, we build a large-scale Human Activity Knowledge Engine (HAKE) based on the human body part states.
You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization
YOWO is a single-stage architecture with two branches to extract temporal and spatial information concurrently and predict bounding boxes and action probabilities directly from video clips in one evaluation.