Self-Supervised Action Recognition
34 papers with code • 6 benchmarks • 5 datasets
Most implemented papers
Contrastive Multiview Coding
We analyze key properties of the approach that make it work, finding that the contrastive loss outperforms a popular alternative based on cross-view prediction, and that the more views we learn from, the better the resulting representation captures underlying scene semantics.
Spatiotemporal Contrastive Video Representation Learning
Our representations are learned using a contrastive loss, where two augmented clips from the same short video are pulled together in the embedding space, while clips from different videos are pushed away.
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets.
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks, while image teachers transfer stronger spatial representations for spatially-heavy video tasks.
Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework
With the proposed Inter-Intra Contrastive (IIC) framework, we can train spatio-temporal convolutional networks to learn video representations.
A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning
We present a large-scale study on unsupervised spatiotemporal representation learning from videos.
Masked Motion Encoding for Self-Supervised Video Representation Learning
The latest attempts seek to learn a representation model by predicting the appearance contents in the masked regions.
Similarity Contrastive Estimation for Image and Video Soft Contrastive Self-Supervised Learning
A good data representation should contain relations between the instances, or semantic similarity and dissimilarity, that contrastive learning harms by considering all negatives as noise.
Unsupervised Representation Learning by Sorting Sequences
We present an unsupervised representation learning approach using videos without semantic labels.
Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics
We conduct extensive experiments with C3D to validate the effectiveness of our proposed approach.