Video Captioning
162 papers with code • 11 benchmarks • 32 datasets
Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.
Source: NITS-VC System for VATEX Video Captioning Challenge 2020
Libraries
Use these libraries to find Video Captioning models and implementationsSubtasks
Most implemented papers
Top-down Visual Saliency Guided by Captions
Neural image/video captioning models can generate accurate descriptions, but their internal process of mapping regions to words is a black box and therefore difficult to explain.
ECO: Efficient Convolutional Network for Online Video Understanding
In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time.
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.
What and How Well You Performed? A Multitask Learning Approach to Action Quality Assessment
Can performance on the task of action quality assessment (AQA) be improved by exploiting a description of the action and its quality?
Multi-modal Dense Video Captioning
We apply automatic speech recognition (ASR) system to obtain a temporally aligned textual description of the speech (similar to subtitles) and treat it as a separate input alongside video frames and the corresponding audio track.
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
Most video-and-language representation learning approaches employ contrastive learning, e. g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs.
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
Most existing text-video retrieval methods focus on cross-modal matching between the visual content of videos and textual query sentences.
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
Reconstruction Network for Video Captioning
Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (sentence to video) flows for video captioning.
VideoBERT: A Joint Model for Video and Language Representation Learning
Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube.