Visual Reasoning
211 papers with code • 12 benchmarks • 41 datasets
Ability to understand actions and reasoning associated with any visual images
Libraries
Use these libraries to find Visual Reasoning models and implementationsMost implemented papers
Learning Transferable Visual Models From Natural Language Supervision
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.
VSE++: Improving Visual-Semantic Embeddings with Hard Negatives
We present a new technique for learning visual-semantic embeddings for cross-modal retrieval.
Compositional Attention Networks for Machine Reasoning
We present the MAC network, a novel fully differentiable neural network architecture, designed to facilitate explicit and expressive reasoning.
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.
Visual Instruction Tuning
Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field.
VisualBERT: A Simple and Performant Baseline for Vision and Language
We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks.
UNITER: UNiversal Image-TExt Representation Learning
Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).
VinVL: Revisiting Visual Representations in Vision-Language Models
In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model \oscar \cite{li2020oscar}, and utilize an improved approach \short\ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.