Temporal Action Localization

422 papers with code • 14 benchmarks • 42 datasets

Temporal Action Localization aims to detect activities in the video stream and output beginning and end timestamps. It is closely related to Temporal Action Proposal Generation.

Benchmarks

Add a Result

These leaderboards are used to track progress in Temporal Action Localization

Dataset	Best Model	Compare
THUMOS’14	AdaTAD (VideoMAEv2-giant)	See all
ActivityNet-1.3	ActionMamba (InternVideo2-6B)	See all
HACS	ActionMamba(InternVideo2-6B)	See all
CrossTask	VideoCLIP	See all
MultiTHUMOS	TriDet (VideoMAEv2)	See all
FineAction	ActionMamba(InternVideo2-6B)	See all
EPIC-KITCHENS-100	AdaTAD (verb, VideoMAE-L)	See all
MUSES	TemporalMaxer	See all
MEXaction2	S-CNN	See all
ActivityNet-1.2	DeepMetricLearner	See all
THUMOS'14	AdaTAD (VideoMAEv2-giant)	See all
Ego4D MQ val	ActionFormer (SlowFast+Omnivore+EgoVLP)	See all
Ego4D MQ test	ActionFormer (SlowFast+Omnivore+EgoVLP)	See all
THUMOS14	BasicTAD (R50-SlowOnly)	See all

Show all 14 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Temporal Action Localization models and implementations

open-mmlab/mmaction2

9 papers

3,888

yjxiong/caffe

4 papers

550

towhee-io/towhee

3 papers

2,991

bryanyzhu/two-stream-pytorch

3 papers

554

See all 12 libraries.

Datasets

Subtasks

Temporal Action Proposal Generation

Activity Recognition In Videos

Action Recognition In Still Images

Most implemented papers

Most implemented Social Latest No code

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

yysijie/st-gcn • • 23 Jan 2018

Dynamics of human body skeletons convey significant information for human action recognition.

Paper
Code

Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks

adityac94/Grad_CAM_plus_plus • • 30 Oct 2017

Over the last decade, Convolutional Neural Network (CNN) models have been highly successful in solving complex vision problems.

Paper
Code

A Closer Look at Spatiotemporal Convolutions for Action Recognition

facebookresearch/R2Plus1D • • CVPR 2018

In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition.

Paper
Code

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

yjxiong/temporal-segment-networks • • 2 Aug 2016

The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network.

Paper
Code

BSN: Boundary Sensitive Network for Temporal Action Proposal Generation

wzmsltw/BSN-boundary-sensitive-network.pytorch • • ECCV 2018

Temporal action proposal generation is an important yet challenging problem, since temporal proposals with rich action content are indispensable for analysing real-world videos with long duration and high proportion irrelevant content.

Paper
Code

BMN: Boundary-Matching Network for Temporal Action Proposal Generation

PaddlePaddle/models • • ICCV 2019

To address these difficulties, we introduce the Boundary-Matching (BM) mechanism to evaluate confidence scores of densely distributed proposals, which denote a proposal as a matching pair of starting and ending boundaries and combine all densely distributed BM pairs into the BM confidence map.

Paper
Code

Temporal Segment Networks for Action Recognition in Videos

yjxiong/temporal-segment-networks • • 8 May 2017

Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of TSN and the proposed good practices.

Paper
Code

Unsupervised Learning of Video Representations using LSTMs

emansim/unsupervised-videos • 16 Feb 2015

We further evaluate the representations by finetuning them for a supervised learning problem - human action recognition on the UCF-101 and HMDB-51 datasets.

Paper
Code

Graph-Based Global Reasoning Networks

facebookresearch/GloRe • • CVPR 2019

In this work, we propose a new approach for reasoning globally in which a set of features are globally aggregated over the coordinate space and then projected to an interaction space where relational reasoning can be efficiently computed.

Paper
Code

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

tensorflow/models • • CVPR 2018

The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.

Paper
Code

Temporal Action Localization

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result