IV-CoT: 구조 인식 텍스트-이미지 생성을 위한 암묵적 시각 사고 연쇄

초록

통합 다중 모달 대규모 언어 모델(MLLM)은 강력한 텍스트-이미지 생성 품질을 달성했지만, 객체 수, 공간 관계, 속성 바인딩 및 대략적인 레이아웃이 보존되어야 하는 구조 인식 프롬프트 따르기에는 여전히 어려움을 겪습니다. 우리는 이러한 한계를 부분적으로 단일 조건화 흐름 내에서 구조적 계획과 외형 렌더링이 얽혀 있기 때문이라고 봅니다. 이 문제를 해결하기 위해, 우리는 쿼리 조건부 이미지 생성을 위한 잠재 시각적 추론 프레임워크인 암시적 시각적 사고 사슬(IV-CoT)을 제안합니다. IV-CoT는 시각적 조건화 쿼리를 구조적-의미적 계단식으로 분해하여, 구조적 쿼리가 먼저 잠재 시각적 계획을 형성하고 의미적 쿼리가 이 계획에 따라 외형을 렌더링합니다. 구조적 쿼리를 안내하기 위해, 우리는 훈련 전용 스케치 감독을 도입하여 추론 시 스케치 추출이나 중간 디코딩 없이 스케치로부터 구조를 포착하도록 장려합니다. IV-CoT는 단일 순방향 전달에서 암시적 CoT 추론을 수행하며 GenEval 및 T2I-CompBench에서 우수한 결과를 달성합니다. 시각화 및 분석은 학습된 구조적 및 의미적 쿼리가 구조 인식 생성에서 보완적인 역할을 한다는 것을 보여줍니다.

English

Unified multi-modal large language models (MLLMs) have achieved strong text-to-image generation quality, but still struggle with structure-aware prompt following, where object counts, spatial relations, attribute bindings, and coarse layouts must be preserved. We attribute this limitation in part to the entanglement of structural planning and appearance rendering within a single conditioning stream. To address this issue, we propose Implicit Visual Chain-of-Thought (IV-CoT), a latent visual reasoning framework for query-conditioned image generation. IV-CoT decomposes the visual conditioning queries into a structural-to-semantic cascade, where structural queries first form a latent visual plan and semantic queries then render appearance conditioned on this plan. To guide the structural queries, we introduce training-only sketch supervision, which encourages them to capture structure from sketches without requiring sketch extraction or intermediate decoding at inference time. IV-CoT performs implicit CoT reasoning in a single forward pass and achieves superior results on GenEval and T2I-CompBench. Visualizations and analyses demonstrate that the learned structural and semantic queries play complementary roles in structure-aware generation.