Image Captioning
615 papers with code • 32 benchmarks • 65 datasets
Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.
( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)
Libraries
Use these libraries to find Image Captioning models and implementationsDatasets
Subtasks
Most implemented papers
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images.
Show and Tell: A Neural Image Caption Generator
Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions.
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.
Self-critical Sequence Training for Image Captioning
In this paper we consider the problem of optimizing image captioning systems using reinforcement learning, and show that by carefully optimizing our systems using the test metrics of the MSCOCO task, significant gains in performance can be realized.
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features
Regional dropout strategies have been proposed to enhance the performance of convolutional neural network classifiers.
Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models
We observe that our method consistently outperforms BS and previously proposed techniques for diverse decoding from neural sequence models.
CIDEr: Consensus-based Image Description Evaluation
We propose a novel paradigm for evaluating image descriptions that uses human consensus.
Recurrent Neural Network Regularization
We present a simple regularization technique for Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units.
VQA: Visual Question Answering
Given an image and a natural language question about the image, the task is to provide an accurate natural language answer.
Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing.