Referring Video Object Segmentation
29 papers with code • 2 benchmarks • 2 datasets
Referring video object segmentation aims at segmenting an object in video with language expressions. Unlike the previous video object segmentation, the task exploits a different type of supervision, language expressions, to identify and segment an object referred by the given language expressions in a video.
Most implemented papers
End-to-End Referring Video Object Segmentation with Multimodal Transformers
Due to the complex nature of this multimodal task, which combines text reasoning, video understanding, instance segmentation and tracking, existing approaches typically rely on sophisticated pipelines in order to tackle it.
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
We evaluate our unified models on various benchmarks.
Cross-Modal Self-Attention Network for Referring Image Segmentation
This module controls the information flow of features at different levels.
Language as Queries for Referring Video Object Segmentation
Referring video object segmentation (R-VOS) is an emerging cross-modal task that aims to segment the target object referred by a language expression in all video frames.
Local-Global Context Aware Transformer for Language-Guided Video Segmentation
We explore the task of language-guided video segmentation (LVS).
Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation
Referring video object segmentation aims to predict foreground labels for objects referred by natural language expressions in videos.
Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus
Referring Video Object Segmentation (R-VOS) is a challenging task that aims to segment an object in a video based on a linguistic expression.
Multi-Attention Network for Compressed Video Referring Object Segmentation
To address this problem, we propose a multi-attention network which consists of dual-path dual-attention module and a query-based cross-modal Transformer module.
VLT: Vision-Language Transformer and Query Generation for Referring Segmentation
We propose a Vision-Language Transformer (VLT) framework for referring segmentation to facilitate deep interactions among multi-modal information and enhance the holistic understanding to vision-language features.
1st Place Solution for YouTubeVOS Challenge 2022: Referring Video Object Segmentation
The task of referring video object segmentation aims to segment the object in the frames of a given video to which the referring expressions refer.