Action Recognition
881 papers with code • 49 benchmarks • 105 datasets
Action Recognition is a computer vision task that involves recognizing human actions in videos or images. The goal is to classify and categorize the actions being performed in the video or image into a predefined set of action classes.
In the video domain, it is an open question whether training an action classification network on a sufficiently large dataset, will give a similar boost in performance when applied to a different temporal task or dataset. The challenges of building video datasets has meant that most popular benchmarks for action recognition are small, having on the order of 10k videos.
Please note some benchmarks may be located in the Action Classification or Video Classification tasks, e.g. Kinetics-400.
Libraries
Use these libraries to find Action Recognition models and implementationsDatasets
Subtasks
- Action Recognition In Videos
- 3D Action Recognition
- Self-Supervised Action Recognition
- Few Shot Action Recognition
- Few Shot Action Recognition
- Fine-grained Action Recognition
- Action Triplet Recognition
- Open Set Action Recognition
- Micro-Action Recognition
- Weakly-Supervised Action Recognition
- Atomic action recognition
- Animal Action Recognition
- Transportation Mode Detection
- Open Vocabulary Action Recognition
- Action Recognition In Still Images
Most implemented papers
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available.
Learning Transferable Visual Models From Natural Language Supervision
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories.
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks.
Non-local Neural Networks
Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time.
Learning Spatiotemporal Features with 3D Convolutional Networks
We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset.
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?
The purpose of this study is to determine whether current video datasets have sufficient data for training very deep convolutional neural networks (CNNs) with spatio-temporal three-dimensional (3D) kernels.
CIDEr: Consensus-based Image Description Evaluation
We propose a novel paradigm for evaluating image descriptions that uses human consensus.
Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition
Dynamics of human body skeletons convey significant information for human action recognition.
Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks
Over the last decade, Convolutional Neural Network (CNN) models have been highly successful in solving complex vision problems.
A Closer Look at Spatiotemporal Convolutions for Action Recognition
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition.