Hierarchical text-conditional image generation with CLIP latents
Summary
OpenAI proposes a hierarchical two-stage model for text-conditional image generation using CLIP latents: a prior that generates CLIP image embeddings from text captions, and a diffusion-based decoder that generates images from embeddings. The approach improves image diversity and enables zero-shot language-guided image manipulations.
View Cached Full Text
Cached at: 04/20/26, 02:46 PM
Similar Articles
CLIP: Connecting text and images
CLIP is OpenAI's vision-language model that learns from text-image pairs from the internet, enabling zero-shot visual classification without task-specific training data. It addresses major limitations in traditional computer vision by reducing dependence on expensive labeled datasets and improving real-world generalization.
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
Researchers extend MeanFlow one-step image generation from class labels to flexible text inputs by integrating highly-discriminative LLM-based text encoders, enabling efficient text-conditioned synthesis with improved performance.
krthr/clip-embeddings
A CLIP-based embedding model hosted on Replicate that generates 768-dimensional embeddings for both images and text using the clip-vit-large-patch14 architecture, costing ~$0.00022 per run.
Alien Dreams: An Emerging Art Scene
The article highlights the emerging scene of AI-generated art using OpenAI's CLIP model as a steering mechanism for generative models, showcasing various examples of text-to-image outputs.
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
Re2Pix is a hierarchical video prediction framework that improves future video generation by first predicting semantic representations using frozen vision foundation models, then conditioning a latent diffusion model on these predictions to generate photorealistic frames. The approach addresses train-test mismatches through nested dropout and mixed supervision strategies, achieving improved temporal semantic consistency and perceptual quality on autonomous driving benchmarks.