SCOPE: 복잡한 이미지 생성을 위한 구조적 분해와 조건부 기술 조율

초록

텍스트-이미지 모델은 시각적 충실도에서 큰 진전을 이루었지만, 복잡한 시각적 의도를 충실히 실현하는 것은 여전히 어려운 과제로 남아 있다. 이는 많은 요구사항이 접지(grounding), 생성(generation), 검증(verification) 과정 전반에 걸쳐 추적되어야 하기 때문이다. 우리는 이러한 요구사항을 의미적 약속(semantic commitments)이라 명명하고, 해당 약속이 생성 수명 주기 전반에 걸쳐 동일한 운영 단위로 식별 가능한 상태를 유지하지 못하는 수명 주기 불연속성을 개념적 균열(Conceptual Rift)로 공식화한다. 이를 해결하기 위해 우리는 SCOPE를 제안한다. SCOPE는 사양 기반 스킬 오케스트레이션 프레임워크로, 진화하는 구조화된 사양 내에서 의미적 약속을 유지하고, 해결되지 않거나 위반된 약속을 대상으로 검색, 추론 및 수정 스킬을 조건부로 호출한다. 약속 수준의 의도 실현을 평가하기 위해, 엔터티 및 제약 조건 수준의 사양을 포함하는 인간 주석 벤치마크인 Gen-Arena를 도입하며, 엄격한 엔터티 우선 통과 기준인 엔터티 게이트 의도 통과율(Entity-Gated Intent Pass Rate, EGIP)을 함께 제시한다. SCOPE는 Gen-Arena에서 모든 평가 기준선을 크게 능가하여 0.60의 EGIP를 달성했으며, WISE-V(0.907) 및 MindBench(0.61)에서도 강력한 결과를 보여, 복잡한 이미지 생성을 위한 지속적 약속 추적의 효과성을 입증한다.

English

While text-to-image models have made strong progress in visual fidelity, faithfully realizing complex visual intents remains challenging because many requirements must be tracked across grounding, generation, and verification. We refer to these requirements as semantic commitments and formalize their lifecycle discontinuity as the Conceptual Rift, where commitments may be locally resolved or checked but fail to remain identifiable as the same operational units throughout the generation lifecycle. To address this, we propose SCOPE, a specification-guided skill orchestration framework that maintains semantic commitments in an evolving structured specification and conditionally invokes retrieval, reasoning, and repair skills around unresolved or violated commitments. To evaluate commitment-level intent realization, we introduce Gen-Arena, a human-annotated benchmark with entity- and constraint-level specifications, together with Entity-Gated Intent Pass Rate (EGIP), a strict entity-first pass criterion. SCOPE substantially outperforms all evaluated baselines on Gen-Arena, achieving 0.60 EGIP, and further achieves strong results on WISE-V (0.907) and MindBench (0.61), demonstrating the effectiveness of persistent commitment tracking for complex image generation.

SCOPE: 복잡한 이미지 생성을 위한 구조적 분해와 조건부 기술 조율

SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

초록

Support