Visual Question Answering (VQA)
763 papers with code • 62 benchmarks • 112 datasets
Visual Question Answering (VQA) is a task in computer vision that involves answering questions about an image. The goal of VQA is to teach machines to understand the content of an image and answer questions about it in natural language.
Image Source: visualqa.org
Libraries
Use these libraries to find Visual Question Answering (VQA) models and implementationsDatasets
Subtasks
Most implemented papers
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
For captioning and VQA, we show that even non-attention based models can localize inputs.
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.
ParlAI: A Dialog Research Software Platform
We introduce ParlAI (pronounced "par-lay"), an open-source software platform for dialog research implemented in Python, available at http://parl. ai.
VQA: Visual Question Answering
Given an image and a natural language question about the image, the task is to provide an accurate natural language answer.
A simple neural network module for relational reasoning
Relational reasoning is a central component of generally intelligent behavior, but has proven difficult for neural networks to learn.
Stacked Attention Networks for Image Question Answering
Thus, we develop a multiple-layer SAN in which we query an image multiple times to infer the answer progressively.
ECG Heartbeat Classification: A Deep Transferable Representation
Electrocardiogram (ECG) can be reliably used as a measure to monitor the functionality of the cardiovascular system.
Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering
This paper presents a new baseline for visual question answering task.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.
Dynamic Memory Networks for Visual and Textual Question Answering
Neural network architectures with memory and attention mechanisms exhibit certain reasoning capabilities required for question answering.