Temporal Sentence Grounding
11 papers with code • 1 benchmarks • 1 datasets
Temporal sentence grounding (TSG) aims to locate a specific moment from an untrimmed video with a given natural language query. For this task, different levels of supervision are used. 1) Weak supervision: video-level action category set; 2) Semi-weak supervision: video-level action category set, and action annotations at several timestamps; 3) Full supervision: Action category and action interval annotations of all actions in untrimmed videos.
Most implemented papers
Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding
Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Mutual Matching Network (MMN), to directly model the similarity between language queries and video moments in a joint embedding space.
Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos
Temporal sentence grounding in videos aims to detect and localize one target video segment, which semantically corresponds to a given sentence.
Uncovering Hidden Challenges in Query-Based Video Moment Retrieval
In this paper, we present a series of experiments assessing how well the benchmark results reflect the true progress in solving the moment retrieval task.
Context-aware Biaffine Localizing Network for Temporal Sentence Grounding
This paper addresses the problem of temporal sentence grounding (TSG), which aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query.
Weakly Supervised Temporal Sentence Grounding With Gaussian-Based Contrastive Proposal Learning
Moreover, they train their model to distinguish positive visual-language pairs from negative ones randomly collected from other videos, ignoring the highly confusing video segments within the same video.
D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation
Under this setup, we propose a Dynamic Gaussian prior based Grounding framework with Glance annotation (D3G), which consists of a Semantic Alignment Group Contrastive Learning module (SA-GCL) and a Dynamic Gaussian prior Adjustment module (DGA).
Temporal Sentence Grounding in Streaming Videos
The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query.
Learning Temporal Sentence Grounding From Narrated EgoVideos
Compared to traditional benchmarks on which this task is evaluated, these datasets offer finer-grained sentences to ground in notably longer videos.
BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos
However, they suffer from the issue of center misalignment raised by the inherent ambiguity of moment centers, leading to inaccurate predictions.
Gaussian Mixture Proposals with Pull-Push Learning Scheme to Capture Diverse Events for Weakly Supervised Temporal Video Grounding
In the weakly supervised temporal video grounding study, previous methods use predetermined single Gaussian proposals which lack the ability to express diverse events described by the sentence query.