Dense Video Captioning
24 papers with code • 4 benchmarks • 7 datasets
Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. The task of dense video captioning involves both detecting and describing events in a video.
Most implemented papers
Multi-modal Dense Video Captioning
We apply automatic speech recognition (ASR) system to obtain a temporally aligned textual description of the speech (similar to subtitles) and treat it as a separate input alongside video frames and the corresponding audio track.
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.
A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer
We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence task.
End-to-End Dense Video Captioning with Parallel Decoding
Dense video captioning aims to generate multiple associated captions with their temporal locations from the video.
SoccerNet 2023 Challenges Results
More information on the tasks, challenges, and leaderboards are available on https://www. soccer-net. org.
Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
There has been significant attention to the research on dense video captioning, which aims to automatically localize and caption all events within untrimmed video.
Towards Automatic Learning of Procedures from Web Instructional Videos
To answer this question, we introduce the problem of procedure segmentation--to segment a video procedure into category-independent procedure segments.
Joint Event Detection and Description in Continuous Video Streams
In order to explicitly model temporal relationships between visual events and their captions in a single video, we also propose a two-level hierarchical captioning module that keeps track of context.
Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning
We propose a bidirectional proposal method that effectively exploits both past and future contexts to make proposal predictions.
End-to-End Dense Video Captioning with Masked Transformer
To address this problem, we propose an end-to-end transformer model for dense video captioning.