Video Captioning

162 papers with code • 11 benchmarks • 32 datasets

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Captioning

Dataset	Best Model	Compare
MSR-VTT	mPLUG-2	See all
MSVD	MaMMUT	See all
YouCook2	VAST	See all
VATEX	VALOR	See all
ActivityNet Captions	VideoCoCa	See all
Hindi MSR-VTT	SBD_Keyframe	See all
TVC	VAST	See all
MSVD-Indonesian	VNS-GRU (Cross-Lingual)	See all
ChinaOpen-1k	GVT	See all
Shot2Story20K	Ours	See all
VidChapters-7M	Vid2Seq	See all

Show all 11 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Captioning models and implementations

rakshithShetty/captionGAN

2 papers

Datasets

Subtasks

Audio-Visual Video Captioning

Video Boundary Captioning

Most implemented papers

Most implemented Social Latest No code

Top-down Visual Saliency Guided by Captions

IgnacioHeredia/plant_classification • CVPR 2017

Neural image/video captioning models can generate accurate descriptions, but their internal process of mapping regions to words is a black box and therefore difficult to explain.

Paper
Code

ECO: Efficient Convolutional Network for Online Video Understanding

mzolfaghari/ECO-efficient-video-understanding • • ECCV 2018

In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time.

Paper
Code

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

m-bain/frozen-in-time • • ICCV 2021

Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.

Paper
Code

What and How Well You Performed? A Multitask Learning Approach to Action Quality Assessment

ParitoshParmar/MTL-AQA • • CVPR 2019

Can performance on the task of action quality assessment (AQA) be improved by exploiting a description of the action and its quality?

Paper
Code

Multi-modal Dense Video Captioning

v-iashin/MDVC • • 17 Mar 2020

We apply automatic speech recognition (ASR) system to obtain a temporally aligned textual description of the speech (similar to subtitles) and treat it as a separate input alongside video frames and the corresponding audio track.

Paper
Code

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

jpthu17/emcl • • 21 Nov 2022

Most video-and-language representation learning approaches employ contrastive learning, e. g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs.

Paper
Code

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

whwu95/Cap4Video • • CVPR 2023

Most existing text-video retrieval methods focus on cross-modal matching between the visual content of videos and textual query sentences.

Paper
Code

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

alibaba/AliceMind • • 1 Feb 2023

In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.

Paper
Code

Reconstruction Network for Video Captioning

nasib-ullah/video-captioning-models-in-Pytorch • • CVPR 2018

Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (sentence to video) flows for video captioning.

Paper
Code

VideoBERT: A Joint Model for Video and Language Representation Learning

ammesatyajit/VideoBERT • • ICCV 2019

Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube.

Paper
Code

Video Captioning

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result