Activity Recognition In Videos
10 papers with code • 1 benchmarks • 2 datasets
Most implemented papers
Very Deep Convolutional Networks for Large-Scale Image Recognition
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting.
Representation Flow for Action Recognition
Our representation flow layer is a fully-differentiable layer designed to capture the `flow' of any representation channel within a convolutional neural network for action recognition.
Large-scale weakly-supervised pre-training for video action recognition
Second, frame-based models perform quite well on action recognition; is pre-training for good image features sufficient or is pre-training for spatio-temporal features valuable for optimal transfer learning?
ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos
Our framework leverages both labeled and unlabelled data to robustly learn action representations in videos, combining pseudo-labeling with contrastive learning for effective learning from both types of samples.
Pooled Motion Features for First-Person Videos
In this paper, we present a new feature representation for first-person videos.
Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors
Visual features are of vital importance for human action understanding in videos.
Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters
In this paper, we newly introduce the concept of temporal attention filters, and describe how they can be used for human activity recognition from videos.
Convolutional Spiking Neural Networks for Spatio-Temporal Feature Extraction
Spiking neural networks (SNNs) can be used in low-power and embedded systems (such as emerging neuromorphic chips) due to their event-based nature.
TorMentor: Deterministic dynamic-path, data augmentations with fractals
We propose the use of fractals as a means of efficient data augmentation.
Dual-path Adaptation from Image to Video Transformers
In this paper, we efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters.