SCOPE：结构化分解与条件化技能编排用于复杂图像生成

摘要

尽管文本到图像模型在视觉逼真度方面取得了显著进展，但在忠实实现复杂视觉意图方面仍面临挑战，因为许多需求必须贯穿于基础理解、生成和验证的全过程。我们将这些需求称为**语义承诺**，并将其生命周期中的不连续性形式化为**概念鸿沟**——即承诺可能在局部被解析或检查，但无法作为同一操作单元在生成生命周期中始终可识别。为了解决这一问题，我们提出了**SCOPE**，一种规范引导的技能编排框架，该框架在动态演化的结构化规范中维护语义承诺，并针对未解决或违反的承诺条件性地调用检索、推理和修复技能。为了评估承诺级别的意图实现，我们引入了**Gen-Arena**，一个包含实体级和约束级规范的人工标注基准，以及**实体门控意图通过率（EGIP）**，一种严格的实体优先通过标准。SCOPE在Gen-Arena上显著优于所有基线方法，取得了0.60的EGIP，并在WISE-V（0.907）和MindBench（0.61）上取得强劲结果，证明了持久承诺追踪对复杂图像生成的有效性。

English

While text-to-image models have made strong progress in visual fidelity, faithfully realizing complex visual intents remains challenging because many requirements must be tracked across grounding, generation, and verification. We refer to these requirements as semantic commitments and formalize their lifecycle discontinuity as the Conceptual Rift, where commitments may be locally resolved or checked but fail to remain identifiable as the same operational units throughout the generation lifecycle. To address this, we propose SCOPE, a specification-guided skill orchestration framework that maintains semantic commitments in an evolving structured specification and conditionally invokes retrieval, reasoning, and repair skills around unresolved or violated commitments. To evaluate commitment-level intent realization, we introduce Gen-Arena, a human-annotated benchmark with entity- and constraint-level specifications, together with Entity-Gated Intent Pass Rate (EGIP), a strict entity-first pass criterion. SCOPE substantially outperforms all evaluated baselines on Gen-Arena, achieving 0.60 EGIP, and further achieves strong results on WISE-V (0.907) and MindBench (0.61), demonstrating the effectiveness of persistent commitment tracking for complex image generation.

SCOPE：结构化分解与条件化技能编排用于复杂图像生成

SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

摘要

Support