Referring Expression

116 papers with code • 1 benchmarks • 3 datasets

Referring expressions places a bounding box around the instance corresponding to the provided description and image.

Benchmarks

Add a Result

These leaderboards are used to track progress in Referring Expression

Trend	Dataset	Best Model	Paper	Code	Compare
	SQA3D	Random			See all

Libraries

Use these libraries to find Referring Expression models and implementations

huggingface/transformers

2 papers

124,984

Datasets

Most implemented papers

Most implemented Social Latest No code

UNITER: UNiversal Image-TExt Representation Learning

ChenRocks/UNITER • • ECCV 2020

Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).

Paper
Code

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

idea-research/groundingdino • • 9 Mar 2023

To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion.

Paper
Code

Modeling Context in Referring Expressions

lichengunc/refer • 31 Jul 2016

Humans refer to objects in their environments all the time, especially in dialogue with other people.

Paper
Code

Image Segmentation Using Text and Image Prompts

timojl/clipseg • • CVPR 2022

After training on an extended version of the PhraseCut dataset, our system generates a binary segmentation map for an image based on a free-text prompt or on an additional image expressing the query.

Paper
Code

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

ofa-sys/ofa • • 7 Feb 2022

In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization.

Paper
Code

CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions

ruotianluo/iep-ref • • CVPR 2019

Yet there has been evidence that current benchmark datasets suffer from bias, and current state-of-the-art models cannot be easily evaluated on their intermediate reasoning process.

Paper
Code

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

jackroos/VL-BERT • • ICLR 2020

We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short).

Paper
Code

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

ashkamath/mdetr • • 26 Apr 2021

We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.

Paper
Code

SeqTR: A Simple yet Universal Network for Visual Grounding

sean-zhuh/seqtr • • 30 Mar 2022

In this paper, we propose a simple yet universal network termed SeqTR for visual grounding tasks, e. g., phrase localization, referring expression comprehension (REC) and segmentation (RES).

Paper
Code

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

microsoft/SoM • • 17 Oct 2023

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V.

Paper
Code

Referring Expression

Benchmarks Add a Result

Libraries

Datasets

Most implemented papers

Content

Benchmarks

Add a Result