Few Shot Action Recognition
25 papers with code • 4 benchmarks • 5 datasets
Few-shot (FS) action recognition is a challenging com- puter vision problem, where the task is to classify an unlabelled query video into one of the action categories in the support set having limited samples per action class.
Most implemented papers
Temporal-Relational CrossTransformers for Few-Shot Action Recognition
We propose a novel approach to few-shot action recognition, finding temporally-corresponding frame tuples between the query and videos in the support set.
Action Genome: Actions as Composition of Spatio-temporal Scene Graphs
Next, by decomposing and learning the temporal changes in visual relationships that result in an action, we demonstrate the utility of a hierarchical event decomposition by enabling few-shot action recognition, achieving 42. 7% mAP using as few as 10 examples.
Few-shot Action Recognition with Permutation-invariant Attention
Such encoded blocks are aggregated by permutation-invariant pooling to make our approach robust to varying action lengths and long-range temporal dependencies whose patterns are unlikely to repeat even in clips of the same class.
Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition
Humans can easily recognize actions with only a few examples given, while the existing video recognition models still heavily rely on the large-scale labeled data inputs.
Few-shot Action Recognition with Prototype-centered Attentive Learning
Extensive experiments on four standard few-shot action benchmarks show that our method clearly outperforms previous state-of-the-art methods, with the improvement particularly significant (10+\%) on the most challenging fine-grained action recognition benchmark.
Home Action Genome: Cooperative Compositional Action Understanding
However, there remains a lack of studies that extend action composition and leverage multiple viewpoints and multiple modalities of data for representation learning.
TA2N: Two-Stage Action Alignment Network for Few-shot Action Recognition
The first stage locates the action by learning a temporal affine transform, which warps each video feature to its action duration while dismissing the action-irrelevant feature (e. g. background).
A New Split for Evaluating True Zero-Shot Action Recognition
We benchmark several recent approaches on the proposed True Zero-Shot(TruZe) Split for UCF101 and HMDB51, with zero-shot and generalized zero-shot evaluation.
Temporal Alignment Prediction for Supervised Representation Learning and Few-Shot Sequence Classification
Explainable distances for sequence data depend on temporal alignment to tackle sequences with different lengths and local variances.
Object-Region Video Transformers
In this work, we present Object-Region Video Transformers (ORViT), an \emph{object-centric} approach that extends video transformer layers with a block that directly incorporates object representations.