Hierarchical text-conditional image generation with CLIP latents
Summary
OpenAI proposes a hierarchical two-stage model for text-conditional image generation using CLIP latents: a prior that generates CLIP image embeddings from text captions, and a diffusion-based decoder that generates images from embeddings. The approach improves image diversity and enables zero-shot language-guided image manipulations.
View Cached Full Text
Cached at: 04/20/26, 02:46 PM
Similar Articles
CLIP: Connecting text and images
CLIP is OpenAI's vision-language model that learns from text-image pairs from the internet, enabling zero-shot visual classification without task-specific training data. It addresses major limitations in traditional computer vision by reducing dependence on expensive labeled datasets and improving real-world generalization.
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
Researchers extend MeanFlow one-step image generation from class labels to flexible text inputs by integrating highly-discriminative LLM-based text encoders, enabling efficient text-conditioned synthesis with improved performance.
Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning
Introduces CLVR (Closed-Loop Visual Reasoning), a framework that reformulates text-to-image generation from a single-step process into a closed-loop, multi-step visual reasoning approach using a VLM controller and diffusion models, achieving improved performance on compositional prompts.
@xichen_pan: Modern text-to-image models are increasingly powered by large pretrained LLMs. But there is a curious mismatch: the LLM…
RepFusion introduces a method to use pretrained multimodal LLMs as noisy representation encoders in diffusion transformers for text-to-image generation, outperforming baselines with similar compute.
Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models
This paper proposes HT-GRPO, a hierarchical reinforcement learning method for diffusion multi-modal large language models that uses a sketch-then-paint training scheme and hierarchical credit assignment to improve image generation quality and reward alignment.