SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

Hugging Face Daily Papers Papers

Summary

SCOPE is a specification-guided framework for text-to-image generation that tracks semantic commitments to better fulfill complex visual intents. It introduces the Gen-Arena benchmark and demonstrates strong performance on complex generation tasks.

While text-to-image models have made strong progress in visual fidelity, faithfully realizing complex visual intents remains challenging because many requirements must be tracked across grounding, generation, and verification. We refer to these requirements as semantic commitments and formalize their lifecycle discontinuity as the Conceptual Rift, where commitments may be locally resolved or checked but fail to remain identifiable as the same operational units throughout the generation lifecycle. To address this, we propose SCOPE, a specification-guided skill orchestration framework that maintains semantic commitments in an evolving structured specification and conditionally invokes retrieval, reasoning, and repair skills around unresolved or violated commitments. To evaluate commitment-level intent realization, we introduce Gen-Arena, a human-annotated benchmark with entity- and constraint-level specifications, together with Entity-Gated Intent Pass Rate (EGIP), a strict entity-first pass criterion. SCOPE substantially outperforms all evaluated baselines on Gen-Arena, achieving 0.60 EGIP, and further achieves strong results on WISE-V (0.907) and MindBench (0.61), demonstrating the effectiveness of persistent commitment tracking for complex image generation.
Original Article
View Cached Full Text

Cached at: 05/11/26, 07:19 AM

Paper page - SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

Source: https://huggingface.co/papers/2605.08043 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

SCOPE is a specification-guided framework that maintains semantic commitments throughout text-to-image generation to improve complex visual intent fulfillment.

While text-to-image models have made strong progress in visual fidelity, faithfully realizing complex visual intents remains challenging because many requirements must be tracked across grounding, generation, and verification. We refer to these requirements assemantic commitmentsand formalize their lifecycle discontinuity as theConceptual Rift, where commitments may be locally resolved or checked but fail to remain identifiable as the same operational units throughout the generation lifecycle. To address this, we propose SCOPE, aspecification-guided skill orchestrationframework that maintainssemantic commitmentsin an evolving structured specification and conditionally invokes retrieval, reasoning, andrepair skillsaround unresolved or violated commitments. To evaluate commitment-level intent realization, we introduceGen-Arena, a human-annotated benchmark with entity- and constraint-level specifications, together withEntity-Gated Intent Pass Rate(EGIP), a strict entity-first pass criterion. SCOPE substantially outperforms all evaluated baselines onGen-Arena, achieving 0.60 EGIP, and further achieves strong results onWISE-V(0.907) andMindBench(0.61), demonstrating the effectiveness of persistent commitment tracking for complex image generation.

View arXiv pageView PDFProject pageGitHub3Add to collection

Get this paper in your agent:

hf papers read 2605\.08043

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.08043 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.08043 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.08043 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles