Open Vocabulary Attribute Detection
11 papers with code • 2 benchmarks • 1 datasets
Open-Vocabulary Attribute Detection (OVAD) is a task that aims to detect and recognize an open set of objects and their associated attributes in an image. The objects and attributes are defined by text queries during inference, without prior knowledge of the tested classes during training.
Libraries
Use these libraries to find Open Vocabulary Attribute Detection models and implementationsMost implemented papers
Learning Transferable Visual Models From Natural Language Supervision
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision.
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens.
Reproducible scaling laws for contrastive language-image learning
To address these limitations, we investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository.
Open-Vocabulary Object Detection Using Captions
Weakly supervised and zero-shot learning techniques have been explored to scale object detectors to more categories with less supervision, but they have not been as successful and widely adopted as supervised models.
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
Most existing methods in vision language pre-training rely on object-centric features extracted through object detection and make fine-grained alignments between the extracted features and texts.
Localized Vision-Language Matching for Open-vocabulary Object Detection
In this work, we propose an open-vocabulary object detection method that, based on image-caption pairs, learns to detect novel object classes along with a given set of known classes.
Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection
Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision.
Open-vocabulary Attribute Detection
The objective of the novel task and benchmark is to probe object-level attribute information learned by vision-language models.