Temporal Action Localization
422 papers with code • 14 benchmarks • 42 datasets
Temporal Action Localization aims to detect activities in the video stream and output beginning and end timestamps. It is closely related to Temporal Action Proposal Generation.
Libraries
Use these libraries to find Temporal Action Localization models and implementationsDatasets
Subtasks
Most implemented papers
Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition
Dynamics of human body skeletons convey significant information for human action recognition.
Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks
Over the last decade, Convolutional Neural Network (CNN) models have been highly successful in solving complex vision problems.
A Closer Look at Spatiotemporal Convolutions for Action Recognition
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition.
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network.
BSN: Boundary Sensitive Network for Temporal Action Proposal Generation
Temporal action proposal generation is an important yet challenging problem, since temporal proposals with rich action content are indispensable for analysing real-world videos with long duration and high proportion irrelevant content.
BMN: Boundary-Matching Network for Temporal Action Proposal Generation
To address these difficulties, we introduce the Boundary-Matching (BM) mechanism to evaluate confidence scores of densely distributed proposals, which denote a proposal as a matching pair of starting and ending boundaries and combine all densely distributed BM pairs into the BM confidence map.
Temporal Segment Networks for Action Recognition in Videos
Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of TSN and the proposed good practices.
Unsupervised Learning of Video Representations using LSTMs
We further evaluate the representations by finetuning them for a supervised learning problem - human action recognition on the UCF-101 and HMDB-51 datasets.
Graph-Based Global Reasoning Networks
In this work, we propose a new approach for reasoning globally in which a set of features are globally aggregated over the coordinate space and then projected to an interaction space where relational reasoning can be efficiently computed.
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.