Moment Retrieval
48 papers with code • 2 benchmarks • 5 datasets
Moment retrieval can de defined as the task of "localizing moments in a video given a user query".
Description from: QVHIGHLIGHTS: Detecting Moments and Highlights in Videos via Natural Language Queries
Image credit: QVHIGHLIGHTS: Detecting Moments and Highlights in Videos via Natural Language Queries
Libraries
Use these libraries to find Moment Retrieval models and implementationsMost implemented papers
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w. r. t.
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
We present HERO, a novel framework for large-scale video+language omni-representation learning.
Finding Moments in Video Collections Using Natural Language
We evaluate our approach on two recently proposed datasets for temporal localization of moments in video with natural language (DiDeMo and Charades-STA) extended to our video corpus moment retrieval setting.
TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval
The queries are also labeled with query types that indicate whether each of them is more related to video or subtitle or both, allowing for in-depth analysis of the dataset and the methods that built on top of it.
Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding
Dummy tokens conditioned by text query take portions of the attention weights, preventing irrelevant video clips from being represented by the text query.
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.
Weakly Supervised Video Moment Retrieval From Text Queries
The weak nature of the supervision is because, during training, we only have access to the video-text pairs rather than the temporal extent of the video to which different text descriptions relate.
Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos
Query-based moment retrieval aims to localize the most relevant moment in an untrimmed video according to the given natural language query.
Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos
Thus, these methods fail to distinguish the target moment from plausible negative moments.
VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval
This paper explores methods for performing VMR in a weakly-supervised manner (wVMR): training is performed without temporal moment labels but only with the text query that describes a segment of the video.