Visual Dialog
54 papers with code • 8 benchmarks • 10 datasets
Visual Dialog requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a follow-up question about the image, the task is to answer the question.
Libraries
Use these libraries to find Visual Dialog models and implementationsDatasets
Most implemented papers
Visual Dialog
We introduce the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content.
Hierarchical Question-Image Co-Attention for Visual Question Answering
In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning
Specifically, we pose a cooperative 'image guessing' game between two agents -- Qbot and Abot -- who communicate in natural language dialog so that Qbot can select an unseen image from a lineup of images.
Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7
Scene-aware dialog systems will be able to have conversations with users about the objects and events around them.
Visual Dialogue without Vision or Dialogue
We characterise some of the quirks and shortcomings in the exploration of Visual Dialogue - a sequential question-answering task where the questions and corresponding answers are related through given visual stimuli.
Dual Attention Networks for Visual Reference Resolution in Visual Dialog
Specifically, REFER module learns latent relationships between a given question and a dialog history by employing a self-attention mechanism.
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline
Next, we find that additional finetuning using "dense" annotations in VisDial leads to even higher NDCG -- more than 10% over our base model -- but hurts MRR -- more than 17% below our base model!
History for Visual Dialog: Do we really need it?
Visual Dialog involves "understanding" the dialog history (what has been discussed previously) and the current question (what is asked), in addition to grounding information in the image, to generate the correct response.
Where Are You? Localization from Embodied Dialog
In this paper, we focus on the LED task -- providing a strong baseline model with detailed ablations characterizing both dataset biases and the importance of various modeling choices.
The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training
As a result, GST scales the amount of training data up to an order of magnitude that of VisDial (1. 2M to 12. 9M QA data).