Video Grounding
41 papers with code • 2 benchmarks • 8 datasets
Video grounding is the task of linking spoken language descriptions to specific video segments. In video grounding, the model is given a video and a natural language description, such as a sentence or a caption, and its goal is to identify the specific segment of the video that corresponds to the description. This can involve tasks such as localizing the objects or actions mentioned in the description within the video, or associating a specific time interval with the description.
Most implemented papers
Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding
Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Mutual Matching Network (MMN), to directly model the similarity between language queries and video moments in a joint embedding space.
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.
Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos
The task of video grounding, which temporally localizes a natural language description in a video, plays an important role in understanding videos.
Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences
In this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form Sentences (STVG).
Dense Regression Network for Video Grounding
The key idea of this paper is to use the distances between the frame within the ground truth and the starting (ending) frame as dense supervisions to improve the video grounding accuracy.
Human-centric Spatio-Temporal Video Grounding With Visual Transformers
HC-STVG is a video grounding task that requires both spatial (where) and temporal (when) localization.
VLG-Net: Video-Language Graph Matching Network for Video Grounding
Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query.
Cross-Modal learning for Audio-Visual Video Parsing
In this paper, we present a novel approach to the audio-visual video parsing (AVVP) task that demarcates events from a video separately for audio and visual modalities.
Interventional Video Grounding with Dual Contrastive Learning
2) Meanwhile, we introduce a dual contrastive learning approach (DCL) to better align the text and video by maximizing the mutual information (MI) between query and video clips, and the MI between start/end frames of a target moment and the others within a video to learn more informative visual representations.
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.