Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

Hugging Face Daily Papers 05/14/26, 12:00 AM Papers

Summary

Introduces CLVR (Closed-Loop Visual Reasoning), a framework that reformulates text-to-image generation from a single-step process into a closed-loop, multi-step visual reasoning approach using a VLM controller and diffusion models, achieving improved performance on compositional prompts.

Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi-step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post-hoc reflection, long-context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed-Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual-language logical planning with pixel-level diffusion generation. CLVR introduces an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long-context optimization instabilities by distilling interleaved multimodal histories into explicit reward signals for accurate causal attribution. Furthermore, to mitigate the severe latency bottleneck caused by iterative denoising, we propose Δ-Space Weight Merge (DSWM), a theoretically grounded method that fuses alignment weights with off-the-shelf distillation priors, reducing the per-step inference cost to just 4 NFEs without requiring expensive re-distillation. Extensive experiments demonstrate that CLVR outperforms existing open-source baselines across multiple benchmarks and approaches the performance of proprietary commercial models, unlocking general test-time scaling capabilities for complex visual generation.

Original Article

View Cached Full Text

Cached at: 05/15/26, 08:24 AM

Paper page - Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

Source: https://huggingface.co/papers/2605.14876 Current text-to-image models still rely heavily on a “single-step generation” paradigm: the model is expected to satisfy all semantic constraints in one denoising process. As prompts become more compositional, this often leads to counting failures, broken spatial relations, attribute confusion, and semantic drift.

In this work, we introduce CLVR (Closed-Loop Visual Reasoning), a framework that reformulates image generation from one-shot prompt-to-image mapping into a closed-loop, multi-step visual reasoning process.

At each step, a VLM controller observes the current canvas and accumulated trajectory, reasons about the remaining semantic gaps, and decides whether to invoke image generation/editing, perform validation, or terminate. The diffusion model then executes the selected visual action, and the updated canvas is fed back into the next reasoning step.

Instead of generating an image once, a VLM continuously inspects the current canvas, identifies semantic gaps, plans the next action, and iteratively edits the image until the user goal is satisfied.

To make this work, we build:

averified trajectory synthesis pipelinewith both step-level and global verification,
**Proxy Prompt Reinforcement Learning (PPRL)**for stable long-context multimodal RL,
and**Δ-Space Weight Merge (DSWM)**to reduce reasoning inference from 28×2 NFEs to only 4 NFEs without expensive re-distillation.

One key finding from our semantic complexity probe is that single-step T2I models exhibit a capability ceiling as prompt complexity increases, while CLVR maintains substantially stronger performance across high-complexity tiers.

Across GenEval, PRISM, WiseBench, and other benchmarks, CLVR consistently improves over strong open-source baselines and narrows the gap with proprietary systems like GPT-4o and Gemini.

Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

Paper page - Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

Similar Articles

CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Video Models Can Reason with Verifiable Rewards

CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

Submit Feedback

Similar Articles

CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Video Models Can Reason with Verifiable Rewards

CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization