Zero-Shot Action Recognition
34 papers with code • 7 benchmarks • 6 datasets
Benchmarks
These leaderboards are used to track progress in Zero-Shot Action Recognition
Libraries
Use these libraries to find Zero-Shot Action Recognition models and implementationsMost implemented papers
Revisiting Classifier: Transferring Vision-Language Models for Video Recognition
In this study, we focus on transferring knowledge for video classification tasks.
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition.
Learning a Deep Embedding Model for Zero-Shot Learning
In this paper we argue that the key to make deep ZSL models succeed is to choose the right embedding space.
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M.
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs.
Evaluation of Output Embeddings for Fine-Grained Image Classification
Image classification has advanced significantly in recent years with the availability of large-scale image sets.
Label-Embedding for Image Classification
Attributes act as intermediate representations that enable parameter sharing between classes, a must when training data is scarce.
ActionCLIP: A New Paradigm for Video Action Recognition
Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, prompt and fine-tune".
Bridging Video-text Retrieval with Multiple Choice Questions
As an additional benefit, our method achieves competitive results with much shorter pre-training videos on single-modality downstream tasks, e. g., action recognition with linear evaluation.
FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks
Large-scale pretrained image-text models have shown incredible zero-shot performance in a handful of tasks, including video ones such as action recognition and text-to-video retrieval.