Visual Question Answering

662 papers with code • 18 benchmarks • 19 datasets

This task has no description! Would you like to contribute one?

Benchmarks

Add a Result

These leaderboards are used to track progress in Visual Question Answering

Dataset	Best Model	Compare
MM-Vet	GPT-4V	See all
ViP-Bench	GPT-4V-turbo-detail:high (Visual Prompt)	See all
BenchLMM	GPT-4V	See all
VQA v2 test-dev	mPLUG (Huge)	See all
MSRVTT-QA	Aurora (ours, r=64) Aurora (ours, r=64)	See all
VQA v2 val	BLIP-2 ViT-G OPT 6.7B (fine-tuned)	See all
MSVD-QA	FrozenBiLM	See all
VQA v2 test-std	Aurora (ours, r=64)	See all
MMBench	MiniGPT-4	See all
TextVQA test-standard	PromptCap	See all
COCO Visual Question Answering (VQA) real images 2.0 open ended	MaMMUT (2B)	See all
GRIT	OFA	See all
VQA v2	LXMERT (low-magnitude)	See all
VizWiz	Emu-I *	See all
MM-Vet (w/o External Tools)	Emu-14B	See all
PlotQA-D1	MatCha	See all
PlotQA-D2	MatCha	See all
MS COCO	BenchLMM	See all

Show all 18 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Visual Question Answering models and implementations

huggingface/transformers

8 papers

124,984

faceonlive/ai-research

6 papers

152

salesforce/lavis

5 papers

8,724

gabegrand/adversarial-vqa

4 papers

See all 12 libraries.

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

ramprs/grad-cam • • ICCV 2017

For captioning and VQA, we show that even non-attention based models can localize inputs.

124

Paper
Code

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

peteanderson80/bottom-up-attention • CVPR 2018

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.

Paper
Code

VQA: Visual Question Answering

ramprs/grad-cam • • ICCV 2015

Given an image and a natural language question about the image, the task is to provide an accurate natural language answer.

Paper
Code

A simple neural network module for relational reasoning

kimhc6028/relational-networks • • NeurIPS 2017

Relational reasoning is a central component of generally intelligent behavior, but has proven difficult for neural networks to learn.

Paper
Code

Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering

Cyanogenoid/pytorch-vqa • • 11 Apr 2017

This paper presents a new baseline for visual question answering task.

Paper
Code

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

salesforce/lavis • • 30 Jan 2023

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.

Paper
Code

Dynamic Memory Networks for Visual and Textual Question Answering

therne/dmn-tensorflow • • 4 Mar 2016

Neural network architectures with memory and attention mechanisms exhibit certain reasoning capabilities required for question answering.

Paper
Code

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

facebookresearch/vilbert-multi-task • • NeurIPS 2019

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.

Paper
Code