Image-text matching

83 papers with code • 1 benchmarks • 1 datasets

This task has no description! Would you like to contribute one?

Benchmarks

Add a Result

These leaderboards are used to track progress in Image-text matching

Trend	Dataset	Best Model	Paper	Code	Compare
	CommercialAdsDataset	AlignCMSS			See all

Libraries

Use these libraries to find Image-text matching models and implementations

salesforce/lavis

2 papers

8,722

Datasets

CommercialAdsDataset

Most implemented papers

Most implemented Social Latest No code

AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks

taoxugit/AttnGAN • • CVPR 2018

In this paper, we propose an Attentional Generative Adversarial Network (AttnGAN) that allows attention-driven, multi-stage refinement for fine-grained text-to-image generation.

Paper
Code

UNITER: UNiversal Image-TExt Representation Learning

ChenRocks/UNITER • • ECCV 2020

Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).

Paper
Code

VinVL: Revisiting Visual Representations in Vision-Language Models

pzzhang/VinVL • CVPR 2021

In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model \oscar \cite{li2020oscar}, and utilize an improved approach \short\ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.

Paper
Code

Stacked Cross Attention for Image-Text Matching

kuanghuei/SCAN • • ECCV 2018

Prior work either simply aggregates the similarity of all possible pairs of regions and words without attending differentially to more and less important words or regions, or uses a multi-step attentional process to capture limited number of semantic alignments which is less interpretable.

Paper
Code

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

salesforce/lavis • • 28 Jan 2022

Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision.

Paper
Code

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

salesforce/lavis • • NeurIPS 2021

Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens.

Paper
Code

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

microsoft/Oscar • • ECCV 2020

Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks.

Paper
Code

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

jackroos/VL-BERT • • ICLR 2020

We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short).

Paper
Code

Dual Attention Networks for Multimodal Reasoning and Matching

iammrhelo/pytorch-vqa-dan • • CVPR 2017

We propose Dual Attention Networks (DANs) which jointly leverage visual and textual attention mechanisms to capture fine-grained interplay between vision and language.

Paper
Code

Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking

Wangt-CN/MTFN-RR-PyTorch-Code • • 12 Aug 2019

We propose a novel framework that achieves remarkable matching performance with acceptable model complexity.

Paper
Code

Image-text matching

Benchmarks Add a Result

Libraries

Datasets

Most implemented papers

Content

Benchmarks

Add a Result