Text-to-Image Generation
276 papers with code • 11 benchmarks • 18 datasets
Text-to-Image Generation is a task in computer vision and natural language processing where the goal is to generate an image that corresponds to a given textual description. This involves converting the text input into a meaningful representation, such as a feature vector, and then using this representation to generate an image that matches the description.
Libraries
Use these libraries to find Text-to-Image Generation models and implementationsDatasets
Subtasks
Most implemented papers
Show and Tell: A Neural Image Caption Generator
Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions.
Generative Adversarial Text to Image Synthesis
Automatic synthesis of realistic images from text would be interesting and useful, but current AI systems are still far from this goal.
High-Resolution Image Synthesis with Latent Diffusion Models
By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond.
StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks
Synthesizing high-quality images from text descriptions is a challenging problem in computer vision and has many practical applications.
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks
In this paper, we propose an Attentional Generative Adversarial Network (AttnGAN) that allows attention-driven, multi-stage refinement for fine-grained text-to-image generation.
StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks
In this paper, we propose Stacked Generative Adversarial Networks (StackGAN) aiming at generating high-resolution photo-realistic images.
Taming Transformers for High-Resolution Image Synthesis
We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images.
Zero-Shot Text-to-Image Generation
Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset.
Hierarchical Text-Conditional Image Generation with CLIP Latents
Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style.
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes.