SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

Hugging Face Daily Papers 05/08/26, 12:00 AM Papers

text-to-image ai-generation skill-orchestration benchmark computer-vision semantic-commitments

Summary

SCOPE is a specification-guided framework for text-to-image generation that tracks semantic commitments to better fulfill complex visual intents. It introduces the Gen-Arena benchmark and demonstrates strong performance on complex generation tasks.

While text-to-image models have made strong progress in visual fidelity, faithfully realizing complex visual intents remains challenging because many requirements must be tracked across grounding, generation, and verification. We refer to these requirements as semantic commitments and formalize their lifecycle discontinuity as the Conceptual Rift, where commitments may be locally resolved or checked but fail to remain identifiable as the same operational units throughout the generation lifecycle. To address this, we propose SCOPE, a specification-guided skill orchestration framework that maintains semantic commitments in an evolving structured specification and conditionally invokes retrieval, reasoning, and repair skills around unresolved or violated commitments. To evaluate commitment-level intent realization, we introduce Gen-Arena, a human-annotated benchmark with entity- and constraint-level specifications, together with Entity-Gated Intent Pass Rate (EGIP), a strict entity-first pass criterion. SCOPE substantially outperforms all evaluated baselines on Gen-Arena, achieving 0.60 EGIP, and further achieves strong results on WISE-V (0.907) and MindBench (0.61), demonstrating the effectiveness of persistent commitment tracking for complex image generation.

Original Article

View Cached Full Text

Cached at: 05/11/26, 07:19 AM

Paper page - SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

Source: https://huggingface.co/papers/2605.08043 Authors:

Abstract

SCOPE is a specification-guided framework that maintains semantic commitments throughout text-to-image generation to improve complex visual intent fulfillment.

While text-to-image models have made strong progress in visual fidelity, faithfully realizing complex visual intents remains challenging because many requirements must be tracked across grounding, generation, and verification. We refer to these requirements assemantic commitmentsand formalize their lifecycle discontinuity as theConceptual Rift, where commitments may be locally resolved or checked but fail to remain identifiable as the same operational units throughout the generation lifecycle. To address this, we propose SCOPE, aspecification-guided skill orchestrationframework that maintainssemantic commitmentsin an evolving structured specification and conditionally invokes retrieval, reasoning, andrepair skillsaround unresolved or violated commitments. To evaluate commitment-level intent realization, we introduceGen-Arena, a human-annotated benchmark with entity- and constraint-level specifications, together withEntity-Gated Intent Pass Rate(EGIP), a strict entity-first pass criterion. SCOPE substantially outperforms all evaluated baselines onGen-Arena, achieving 0.60 EGIP, and further achieves strong results onWISE-V(0.907) andMindBench(0.61), demonstrating the effectiveness of persistent commitment tracking for complex image generation.

View arXiv page View PDF Project page GitHub3 Add to collection

Get this paper in your agent:

hf papers read 2605\.08043

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.08043 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.08043 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.08043 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

Paper page - SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation

SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing

Submit Feedback

Similar Articles

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation

SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing