Generalized Referring Expression Segmentation
6 papers with code • 1 benchmarks • 1 datasets
Generalized Referring Expression Segmentation (GRES), introduced by Liu et al in CVPR 2023, allows expressions indicating any number of target objects. GRES takes an image and a referring expression as input, and requires mask prediction of the target object(s).
Most implemented papers
GRES: Generalized Referring Expression Segmentation
Existing classic RES datasets and methods commonly support single-target expressions only, i. e., one expression refers to one target object.
MAttNet: Modular Attention Network for Referring Expression Comprehension
In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression.
Vision-Language Transformer and Query Generation for Referring Segmentation
We introduce transformer and multi-head attention to build a network with an encoder-decoder attention mechanism architecture that "queries" the given image with the language expression.
CRIS: CLIP-Driven Referring Image Segmentation
In addition, we present text-to-pixel contrastive learning to explicitly enforce the text feature similar to the related pixel-level features and dissimilar to the irrelevances.
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
Referring image segmentation is a fundamental vision-language task that aims to segment out an object referred to by a natural language expression from an image.
PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model
PSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the segmentation task challenges.