Referring Expression Segmentation

68 papers with code • 25 benchmarks • 11 datasets

The task aims at labeling the pixels of an image or video that represent an object instance referred by a linguistic expression. In particular, the referring expression (RE) must allow the identification of an individual object in a discourse or scene (the referent). REs unambiguously identify the target instance.

Benchmarks

Add a Result

These leaderboards are used to track progress in Referring Expression Segmentation

Dataset	Best Model	Compare
A2D Sentences	SgMg (Video-Swin-B)	See all
RefCoCo val	HIPIE	See all
Refer-YouTube-VOS (2021 public validation)	GLEE-Pro	See all
RefCOCO testA	UNINEXT-H	See all
RefCOCO+ val	HIPIE	See all
J-HMDB	SgMg (Video-Swin-B)	See all
RefCOCO testB	UNINEXT-H	See all
RefCOCO+ testA	UniLSeg-100	See all
RefCOCO+ test B	UniLSeg-100	See all
DAVIS 2017 (val)	UNINEXT-H	See all
RefCoCo val	UNINEXT-H	See all
RefCOCOg-val	UniLSeg-100	See all
RefCOCOg-test	UniLSeg-100	See all
PhraseCut	GLIPv2	See all
ReferIt	PolyFormer-L	See all
Refer-YouTube-VOS	RefVOS-Human REs	See all
RefCOCO	GLEE-Pro	See all
RefCOCO testA	EVP	See all
RefCOCO testB	EVP	See all
CLEVR-Ref+	IEP-Ref (700K prog.)	See all
A2Dre test	RefVos	See all
Referring Expressions for DAVIS 2016 & 2017	MUTR	See all
G-Ref val	MaIL	See all
G-Ref test A	MaIL	See all
G-Ref test B	MaIL	See all

Show all 25 benchmarks

Collapse benchmarks

Datasets

Subtasks

Generalized Referring Expression Segmentation

Most implemented papers

Most implemented Social Latest No code

Segmentation from Natural Language Expressions

ronghanghu/text_objseg • • 20 Mar 2016

To produce pixelwise segmentation for the language expression, we propose an end-to-end trainable recurrent and convolutional network model that jointly learns to process visual and linguistic information.

Paper
Code

Image Segmentation Using Text and Image Prompts

timojl/clipseg • • CVPR 2022

After training on an extended version of the PhraseCut dataset, our system generates a binary segmentation map for an image based on a free-text prompt or on an additional image expressing the query.

Paper
Code

CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions

ruotianluo/iep-ref • • CVPR 2019

Yet there has been evidence that current benchmark datasets suffer from bias, and current state-of-the-art models cannot be easily evaluated on their intermediate reasoning process.

Paper
Code

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

ashkamath/mdetr • • 26 Apr 2021

We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.

Paper
Code

RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation

miriambellver/refvos • • 1 Oct 2020

The task of video object segmentation with referring expressions (language-guided VOS) is to, given a linguistic phrase and a video, generate binary masks for the object to which the phrase refers.

Paper
Code

SynthRef: Generation of Synthetic Referring Expressions for Object Segmentation

imatge-upc/synthref • • 8 Jun 2021

Recent advances in deep learning have brought significant progress in visual grounding tasks such as language-guided video object segmentation.

Paper
Code

End-to-End Referring Video Object Segmentation with Multimodal Transformers

mttr2021/MTTR • • CVPR 2022

Due to the complex nature of this multimodal task, which combines text reasoning, video understanding, instance segmentation and tracking, existing approaches typically rely on sophisticated pipelines in order to tackle it.

Paper
Code

Unleashing Text-to-Image Diffusion Models for Visual Perception

wl-zhao/VPD • • ICCV 2023

In this paper, we propose VPD (Visual Perception with a pre-trained Diffusion model), a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.

Paper
Code