SCOPE: 複雑な画像生成のための構造的分解と条件付きスキルオーケストレーション

要旨

テキストから画像を生成するモデルは視覚的忠実性において大きな進歩を遂げているが、複雑な視覚的意図を忠実に実現することは依然として困難である。なぜなら、多くの要求事項が grounded（基盤付け）、生成、検証の各段階にわたって追跡される必要があるからである。我々はこれらの要求事項を意味的コミットメントと呼び、そのライフサイクルの不連続性を概念上の隔たり（Conceptual Rift）として定式化する。この隔たりにおいて、コミットメントは局所的に解決されたりチェックされたりするものの、生成ライフサイクル全体を通じて同一の操作単位として識別可能であり続けることができない。この問題に対処するため、我々はSCOPEを提案する。これは仕様に基づくスキルオーケストレーションフレームワークであり、進化する構造化仕様内で意味的コミットメントを維持し、未解決または違反されたコミットメントに対して検索、推論、修復のスキルを条件付きで呼び出す。コミットメントレベルの意図実現を評価するために、エンティティおよび制約レベルの仕様を備えた人手注釈ベンチマークであるGen-Arenaと、厳格なエンティティ優先通過基準であるEntity-Gated Intent Pass Rate（EGIP）を導入する。SCOPEはGen-Arenaにおいて評価したすべてのベースラインを大幅に上回り、EGIP 0.60を達成し、さらにWISE-V（0.907）およびMindBench（0.61）でも強力な結果を示し、複雑な画像生成における永続的なコミットメント追跡の有効性を実証している。

English

While text-to-image models have made strong progress in visual fidelity, faithfully realizing complex visual intents remains challenging because many requirements must be tracked across grounding, generation, and verification. We refer to these requirements as semantic commitments and formalize their lifecycle discontinuity as the Conceptual Rift, where commitments may be locally resolved or checked but fail to remain identifiable as the same operational units throughout the generation lifecycle. To address this, we propose SCOPE, a specification-guided skill orchestration framework that maintains semantic commitments in an evolving structured specification and conditionally invokes retrieval, reasoning, and repair skills around unresolved or violated commitments. To evaluate commitment-level intent realization, we introduce Gen-Arena, a human-annotated benchmark with entity- and constraint-level specifications, together with Entity-Gated Intent Pass Rate (EGIP), a strict entity-first pass criterion. SCOPE substantially outperforms all evaluated baselines on Gen-Arena, achieving 0.60 EGIP, and further achieves strong results on WISE-V (0.907) and MindBench (0.61), demonstrating the effectiveness of persistent commitment tracking for complex image generation.

SCOPE: 複雑な画像生成のための構造的分解と条件付きスキルオーケストレーション

SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

要旨

Support