Image Captioning

615 papers with code • 32 benchmarks • 65 datasets

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Benchmarks

Add a Result

These leaderboards are used to track progress in Image Captioning

Dataset	Best Model	Compare
COCO Captions	mPLUG	See all
MS COCO	ExpansionNet v2	See all
nocaps-val-in-domain	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps-val-overall	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps in-domain	GIT2, Single Model	See all
nocaps-val-near-domain	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps-val-out-domain	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps near-domain	GIT2, Single Model	See all
nocaps out-of-domain	PaLI	See all
SCICAP	CNN+LSTM (Vision only, First sentence)	See all
nocaps entire	Lyrics	See all
Flickr30k Captions test	Unified VLP	See all
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Fine-tuned)	See all
nocaps-XD entire	GIT2	See all
nocaps-XD in-domain	GIT2	See all
nocaps-XD near-domain	GIT2	See all
nocaps-XD out-of-domain	GIT2	See all
nocaps val	Prismer	See all
COCO Captions test	From Captions to Visual Concepts and Back	See all
Localized Narratives	LoopCAG	See all
FlickrStyle10K	CapDec	See all
Conceptual Captions	ClipCap (MLP + GPT2 tuning)	See all
BanglaLekhaImageCaptions	CNN + 1D CNN	See all
AIC-ICC	CMCL	See all
MSCOCO	CapDec	See all
IU X-Ray	BiomedGPT	See all
Peir Gross	BiomedGPT	See all
MS-COCO	NeuSyRE	See all
ChEBI-20	GIT-Mol	See all
VizWiz 2020 test-dev	IBM Research AI	See all
VizWiz 2020 test	IBM Research AI	See all
TextCaps 2020	TAP	See all

Show all 32 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Image Captioning models and implementations

huggingface/transformers

4 papers

124,984

salesforce/lavis

4 papers

8,724

ofa-sys/ofa

3 papers

2,324

google-research/big_vision

3 papers

1,554

See all 8 libraries.

Datasets

Subtasks

Semi Supervised Learning for Image Captioning

Aesthetic Image Captioning

Vietnamese Image Captioning

Hindi Image Captioning

Most implemented papers

Most implemented Social Latest No code

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning • • 10 Feb 2015

Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images.

Paper
Code

Show and Tell: A Neural Image Caption Generator

karpathy/neuraltalk • • CVPR 2015

Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions.

Paper
Code

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

peteanderson80/bottom-up-attention • CVPR 2018

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.

Paper
Code

Self-critical Sequence Training for Image Captioning

ruotianluo/ImageCaptioning.pytorch • • CVPR 2017

In this paper we consider the problem of optimizing image captioning systems using reinforcement learning, and show that by carefully optimizing our systems using the test metrics of the MSCOCO task, significant gains in performance can be realized.

Paper
Code

CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features

clovaai/CutMix-PyTorch • • ICCV 2019

Regional dropout strategies have been proposed to enhance the performance of convolutional neural network classifiers.

Paper
Code

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

ashwinkalyan/dbs • • 7 Oct 2016

We observe that our method consistently outperforms BS and previously proposed techniques for diverse decoding from neural sequence models.

Paper
Code

CIDEr: Consensus-based Image Description Evaluation

tylin/coco-caption • CVPR 2015

We propose a novel paradigm for evaluating image descriptions that uses human consensus.

Paper
Code

Recurrent Neural Network Regularization

wojzaremba/lstm • 8 Sep 2014

We present a simple regularization technique for Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units.

Paper
Code

VQA: Visual Question Answering

ramprs/grad-cam • • ICCV 2015

Given an image and a natural language question about the image, the task is to provide an accurate natural language answer.

Paper
Code

Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge

tensorflow/models • • 21 Sep 2016

Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing.

Paper
Code

Image Captioning

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result