SCOPE：結構化解構與條件技能編排以生成複雜影像

摘要

雖然文字轉圖像模型在視覺逼真度上取得了顯著進展，但要忠實實現複雜的視覺意圖仍充滿挑戰，因為許多需求必須在基礎化、生成與驗證過程中持續追蹤。我們將這些需求稱為語義承諾，並將其生命週期的不連續性形式化為「概念鴻溝」——在該鴻溝中，承諾可能被局部解決或檢查，但無法在整個生成生命週期中保持可識別為相同的操作單元。為解決此問題，我們提出SCOPE，一種規範引導的技能編排框架，該框架在演進中的結構化規範中維護語義承諾，並針對未解決或違反的承諾有條件地調用檢索、推理與修復技能。為評估承諾層級的意圖實現，我們引入Gen-Arena，一個帶有人工標註實體層級與約束層級規範的基準，並搭配嚴格的實體優先通過標準——實體門控意圖通過率(EGIP)。SCOPE在Gen-Arena上顯著優於所有評估基準，達到0.60的EGIP，並在WISE-V(0.907)與MindBench(0.61)上獲得強勁結果，證明了持續承諾追蹤在複雜圖像生成中的有效性。

English

While text-to-image models have made strong progress in visual fidelity, faithfully realizing complex visual intents remains challenging because many requirements must be tracked across grounding, generation, and verification. We refer to these requirements as semantic commitments and formalize their lifecycle discontinuity as the Conceptual Rift, where commitments may be locally resolved or checked but fail to remain identifiable as the same operational units throughout the generation lifecycle. To address this, we propose SCOPE, a specification-guided skill orchestration framework that maintains semantic commitments in an evolving structured specification and conditionally invokes retrieval, reasoning, and repair skills around unresolved or violated commitments. To evaluate commitment-level intent realization, we introduce Gen-Arena, a human-annotated benchmark with entity- and constraint-level specifications, together with Entity-Gated Intent Pass Rate (EGIP), a strict entity-first pass criterion. SCOPE substantially outperforms all evaluated baselines on Gen-Arena, achieving 0.60 EGIP, and further achieves strong results on WISE-V (0.907) and MindBench (0.61), demonstrating the effectiveness of persistent commitment tracking for complex image generation.

SCOPE：結構化解構與條件技能編排以生成複雜影像

SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

摘要

Support