Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning
Summary
Introduces CLVR (Closed-Loop Visual Reasoning), a framework that reformulates text-to-image generation from a single-step process into a closed-loop, multi-step visual reasoning approach using a VLM controller and diffusion models, achieving improved performance on compositional prompts.
View Cached Full Text
Cached at: 05/15/26, 08:24 AM
Paper page - Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning
Source: https://huggingface.co/papers/2605.14876 Current text-to-image models still rely heavily on a “single-step generation” paradigm: the model is expected to satisfy all semantic constraints in one denoising process. As prompts become more compositional, this often leads to counting failures, broken spatial relations, attribute confusion, and semantic drift.
In this work, we introduce CLVR (Closed-Loop Visual Reasoning), a framework that reformulates image generation from one-shot prompt-to-image mapping into a closed-loop, multi-step visual reasoning process.
At each step, a VLM controller observes the current canvas and accumulated trajectory, reasons about the remaining semantic gaps, and decides whether to invoke image generation/editing, perform validation, or terminate. The diffusion model then executes the selected visual action, and the updated canvas is fed back into the next reasoning step.
Instead of generating an image once, a VLM continuously inspects the current canvas, identifies semantic gaps, plans the next action, and iteratively edits the image until the user goal is satisfied.
To make this work, we build:
- averified trajectory synthesis pipelinewith both step-level and global verification,
- **Proxy Prompt Reinforcement Learning (PPRL)**for stable long-context multimodal RL,
- and**Δ-Space Weight Merge (DSWM)**to reduce reasoning inference from 28×2 NFEs to only 4 NFEs without expensive re-distillation.
One key finding from our semantic complexity probe is that single-step T2I models exhibit a capability ceiling as prompt complexity increases, while CLVR maintains substantially stronger performance across high-complexity tiers.
Across GenEval, PRISM, WiseBench, and other benchmarks, CLVR consistently improves over strong open-source baselines and narrows the gap with proprietary systems like GPT-4o and Gemini.
Similar Articles
CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models
CollabVR is a research paper proposing a closed-loop framework that collaboratively integrates vision-language models with video generation models to improve visual reasoning and correct failures in real-time.
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
VisualThink-VLA introduces a visual intermediate reasoning framework for vision-language-action policies that preserves spatial precision and dramatically reduces latency compared to text-based reasoning, achieving sub-second inference and state-of-the-art success rates on robot manipulation benchmarks.
Video Models Can Reason with Verifiable Rewards
VideoRLVR optimizes video diffusion models for verifiable reasoning tasks using reinforcement learning with rule-based rewards, achieving better performance than supervised methods in constraint-satisfying video generation.
CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition
CogOmniControl is a reasoning-driven framework for controllable video generation that uses a specialized vision-language model (CogVLM) trained on anime production data to infer creative intent from sparse conditions, then guides a diffusion-based generator via reinforcement learning, achieving state-of-the-art results on new benchmarks.
VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization
This paper introduces a paradigm where Vision-Language Models (VLMs) act as test-time teachers to guide Video Generation Models (VGMs) via differentiable rewards and LoRA optimization, achieving a 16.7-point average improvement on video reasoning benchmarks.