Dense Video Captioning

24 papers with code • 4 benchmarks • 7 datasets

Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. The task of dense video captioning involves both detecting and describing events in a video.

Benchmarks

Add a Result

These leaderboards are used to track progress in Dense Video Captioning

Dataset	Best Model	Compare
ActivityNet Captions	Vid2Seq	See all
YouCook2	Vid2Seq (HowTo100M+VidChapters-7M PT)	See all
ViTT	Vid2Seq (VidChapters-7M PT)	See all
VidChapters-7M	Vid2Seq	See all

Datasets

Subtasks

Zero-shot dense video captioning

Most implemented papers

Most implemented Social Latest No code

Multi-modal Dense Video Captioning

v-iashin/MDVC • • 17 Mar 2020

We apply automatic speech recognition (ASR) system to obtain a temporally aligned textual description of the speech (similar to subtitles) and treat it as a separate input alongside video frames and the corresponding audio track.

Paper
Code

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

google-research/scenic • • CVPR 2023

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.

Paper
Code

A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

v-iashin/BMT • • 17 May 2020

We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence task.

Paper
Code

End-to-End Dense Video Captioning with Parallel Decoding

ttengwang/pdvc • • ICCV 2021

Dense video captioning aims to generate multiple associated captions with their temporal locations from the video.

Paper
Code

SoccerNet 2023 Challenges Results

lRomul/ball-action-spotting • • 12 Sep 2023

More information on the tasks, challenges, and leaderboards are available on https://www. soccer-net. org.

Paper
Code

Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval

ailab-kyunghee/cm2_dvc • • 11 Apr 2024

There has been significant attention to the research on dense video captioning, which aims to automatically localize and caption all events within untrimmed video.

Paper
Code

Towards Automatic Learning of Procedures from Web Instructional Videos

LuoweiZhou/ProcNets-YouCook2 • • 28 Mar 2017

To answer this question, we introduce the problem of procedure segmentation--to segment a video procedure into category-independent procedure segments.

Paper
Code

Joint Event Detection and Description in Continuous Video Streams

VisionLearningGroup/JEDDi-Net • 28 Feb 2018

In order to explicitly model temporal relationships between visual events and their captions in a single video, we also propose a two-level hierarchical captioning module that keeps track of context.

Paper
Code