획 단위 사고: 인터리브 추론을 통한 프로세스 기반 이미지 생성

초록

인간은 점진적으로 이미지를 그립니다: 전체적인 레이아웃을 계획하고, 대략적인 초안을 스케치하며, 세부 사항을 검토하고 다듬습니다. 가장 중요한 것은 각 단계가 변화하는 시각적 상태에 기반한다는 점입니다. 그러나 텍스트-이미지가 혼합된 데이터셋으로 훈련된 통합 멀티모달 모델도 중간 상태들의 연쇄를 상상할 수 있을까요? 본 논문에서는 생성 과정을 생각과 행동이 교차하는 추론 궤적으로 분해하는 다단계 패러다임인 과정 주도 이미지 생성을 소개합니다. 단일 단계에서 이미지를 생성하는 대신, 우리의 접근 방식은 텍스트 기반 계획, 시각적 초안 작성, 텍스트 기반 반성, 시각적 정교화의 4단계로 구성된 다중 반복에 걸쳐 전개됩니다. 텍스트 추론은 시각적 상태가 어떻게 진화해야 하는지를 명시적으로 조건으로 삼으며, 생성된 시각적 중간 결과는 차례로 다음 차례의 텍스트 추론을 제약하고 기반을 마련합니다. 과정 주도 생성의 핵심 과제는 중간 상태의 모호성에서 비롯됩니다: 모델이 부분적으로 완성된 이미지를 어떻게 평가할 수 있을까요? 우리는 이를 두 가지 상호 보완적인 제약을 유지하는 조밀한 단계별 지도를 통해 해결합니다. 시각적 중간 상태에 대해서는 공간적 및 의미적 일관성을 강화하고, 텍스트 중간 상태에 대해서는 이전 시각적 지식을 보존하면서 모델이 프롬프트를 위반하는 요소를 식별하고 수정할 수 있도록 합니다. 이는 생성 과정을 명시적이고, 해석 가능하며, 직접적으로 지도할 수 있게 만듭니다. 제안된 방법의 타당성을 검증하기 위해 다양한 텍스트-이미지 생성 벤치마크에서 실험을 수행합니다.

English

Humans paint images incrementally: they plan a global layout, sketch a coarse draft, inspect, and refine details, and most importantly, each step is grounded in the evolving visual states. However, can unified multimodal models trained on text-image interleaved datasets also imagine the chain of intermediate states? In this paper, we introduce process-driven image generation, a multi-step paradigm that decomposes synthesis into an interleaved reasoning trajectory of thoughts and actions. Rather than generating images in a single step, our approach unfolds across multiple iterations, each consisting of 4 stages: textual planning, visual drafting, textual reflection, and visual refinement. The textual reasoning explicitly conditions how the visual state should evolve, while the generated visual intermediate in turn constrains and grounds the next round of textual reasoning. A core challenge of process-driven generation stems from the ambiguity of intermediate states: how can models evaluate each partially-complete image? We address this through dense, step-wise supervision that maintains two complementary constraints: for the visual intermediate states, we enforce the spatial and semantic consistency; for the textual intermediate states, we preserve the prior visual knowledge while enabling the model to identify and correct prompt-violating elements. This makes the generation process explicit, interpretable, and directly supervisable. To validate proposed method, we conduct experiments under various text-to-image generation benchmarks.

획 단위 사고: 인터리브 추론을 통한 프로세스 기반 이미지 생성

Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

초록

Support