Visual Question Answering

662 papers with code • 18 benchmarks • 19 datasets

This task has no description! Would you like to contribute one?

Libraries

Use these libraries to find Visual Question Answering models and implementations

Most implemented papers

Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

ramprs/grad-cam ICCV 2017

For captioning and VQA, we show that even non-attention based models can localize inputs.

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

peteanderson80/bottom-up-attention CVPR 2018

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.

VQA: Visual Question Answering

ramprs/grad-cam ICCV 2015

Given an image and a natural language question about the image, the task is to provide an accurate natural language answer.

A simple neural network module for relational reasoning

kimhc6028/relational-networks NeurIPS 2017

Relational reasoning is a central component of generally intelligent behavior, but has proven difficult for neural networks to learn.

Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering

Cyanogenoid/pytorch-vqa 11 Apr 2017

This paper presents a new baseline for visual question answering task.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

salesforce/lavis 30 Jan 2023

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.

Dynamic Memory Networks for Visual and Textual Question Answering

therne/dmn-tensorflow 4 Mar 2016

Neural network architectures with memory and attention mechanisms exhibit certain reasoning capabilities required for question answering.

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

facebookresearch/vilbert-multi-task NeurIPS 2019

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

akirafukui/vqa-mcb EMNLP 2016

Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations.

Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge

peteanderson80/bottom-up-attention CVPR 2018

This paper presents a state-of-the-art model for visual question answering (VQA), which won the first place in the 2017 VQA Challenge.