IV-CoT：構造認識型テキスト画像生成のための暗黙的視覚的思考連鎖

要旨

統一型マルチモーダル大規模言語モデル（MLLM）は、高品質なテキストからの画像生成を実現しているが、オブジェクトの個数、空間関係、属性の結合、大まかなレイアウトなどを保持する構造認識型のプロンプト追従には依然として課題がある。我々は、この制限の一因が、単一の条件付けストリーム内で構造計画と外観レンダリングが絡み合っていることにあると考える。この問題に対処するため、我々は暗黙的視覚的思考連鎖（Implicit Visual Chain-of-Thought, IV-CoT）を提案する。これは、クエリ条件付き画像生成のための潜在的な視覚的推論フレームワークである。IV-CoTは、視覚的条件付けクエリを構造から意味へのカスケードに分解し、構造クエリがまず潜在的な視覚計画を形成し、その後、意味クエリがその計画に基づいて外観をレンダリングする。構造クエリを導くために、我々は訓練時のみのスケッチ教師信号を導入する。これにより、推論時にスケッチ抽出や中間デコードを必要とせずに、スケッチから構造を捉えることが促進される。IV-CoTは単一の順伝搬で暗黙的なCoT推論を実行し、GenEvalおよびT2I-CompBenchにおいて優れた結果を示す。可視化と分析により、学習された構造クエリと意味クエリが構造認識型生成において相補的な役割を果たしていることが実証される。

English

Unified multi-modal large language models (MLLMs) have achieved strong text-to-image generation quality, but still struggle with structure-aware prompt following, where object counts, spatial relations, attribute bindings, and coarse layouts must be preserved. We attribute this limitation in part to the entanglement of structural planning and appearance rendering within a single conditioning stream. To address this issue, we propose Implicit Visual Chain-of-Thought (IV-CoT), a latent visual reasoning framework for query-conditioned image generation. IV-CoT decomposes the visual conditioning queries into a structural-to-semantic cascade, where structural queries first form a latent visual plan and semantic queries then render appearance conditioned on this plan. To guide the structural queries, we introduce training-only sketch supervision, which encourages them to capture structure from sketches without requiring sketch extraction or intermediate decoding at inference time. IV-CoT performs implicit CoT reasoning in a single forward pass and achieves superior results on GenEval and T2I-CompBench. Visualizations and analyses demonstrate that the learned structural and semantic queries play complementary roles in structure-aware generation.