IV-CoT：面向结构感知文本到图像生成的隐式视觉思维链

摘要

统一的多模态大语言模型（MLLMs）在文本到图像生成质量上取得了显著进展，但在结构感知提示遵循方面仍存在不足，尤其是对象计数、空间关系、属性绑定和粗略布局的保持。我们部分地将这一局限归因于结构规划与外观渲染在单一条件流中的纠缠。为解决这一问题，我们提出隐式视觉思维链（IV-CoT），一种用于查询条件图像生成的潜在视觉推理框架。IV-CoT将视觉条件查询分解为结构到语义的级联：结构查询首先形成潜在视觉规划，随后语义查询基于该规划渲染外观。为引导结构查询，我们引入仅用于训练的草图监督，鼓励其从草图中捕获结构信息，而无需在推理阶段进行草图提取或中间解码。IV-CoT通过单次前向传播实现隐式思维链推理，并在GenEval和T2I-CompBench上取得了优越结果。可视化与分析表明，学习到的结构查询与语义查询在结构感知生成中发挥着互补作用。

English

Unified multi-modal large language models (MLLMs) have achieved strong text-to-image generation quality, but still struggle with structure-aware prompt following, where object counts, spatial relations, attribute bindings, and coarse layouts must be preserved. We attribute this limitation in part to the entanglement of structural planning and appearance rendering within a single conditioning stream. To address this issue, we propose Implicit Visual Chain-of-Thought (IV-CoT), a latent visual reasoning framework for query-conditioned image generation. IV-CoT decomposes the visual conditioning queries into a structural-to-semantic cascade, where structural queries first form a latent visual plan and semantic queries then render appearance conditioned on this plan. To guide the structural queries, we introduce training-only sketch supervision, which encourages them to capture structure from sketches without requiring sketch extraction or intermediate decoding at inference time. IV-CoT performs implicit CoT reasoning in a single forward pass and achieves superior results on GenEval and T2I-CompBench. Visualizations and analyses demonstrate that the learned structural and semantic queries play complementary roles in structure-aware generation.