Visual Commonsense Reasoning
29 papers with code • 7 benchmarks • 7 datasets
Image source: Visual Commonsense Reasoning
Datasets
Most implemented papers
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.
UNITER: UNiversal Image-TExt Representation Learning
Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).
From Recognition to Cognition: Visual Commonsense Reasoning
While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world.
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short).
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.
Unifying Vision-and-Language Tasks via Text Generation
On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models.
X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics
Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion.
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
Before sending to LLM, the reference is replaced by RoI features and interleaved with language embeddings as a sequence.
Think Visually: Question Answering through Virtual Imagery
In this paper, we study the problem of geometric reasoning in the context of question-answering.
Fusion of Detected Objects in Text for Visual Question Answering
To advance models of multimodal context, we introduce a simple yet powerful neural architecture for data that combines vision and natural language.