Zero-Shot Transfer Image Classification
16 papers with code • 16 benchmarks • 8 datasets
Libraries
Use these libraries to find Zero-Shot Transfer Image Classification models and implementationsMost implemented papers
Learning Transferable Visual Models From Natural Language Supervision
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories.
CoCa: Contrastive Captioners are Image-Text Foundation Models
We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively.
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset.
LiT: Zero-Shot Transfer with Locked-image text Tuning
This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text models while still taking advantage of their pre-training.
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs.
Your Diffusion Model is Secretly a Zero-Shot Classifier
Our generative approach to classification, which we call Diffusion Classifier, attains strong results on a variety of benchmarks and outperforms alternative methods of extracting knowledge from diffusion models.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Scaling up contrastive language-image pretraining (CLIP) is critical for empowering both vision and multimodal models.
Florence: A New Foundation Model for Computer Vision
Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications.
PaLI: A Jointly-Scaled Multilingual Language-Image Model
PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages.