思维先行后生成:具备推理能力的文本到图像扩散模型与大型语言模型编码器
Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders
January 15, 2026
作者: Siqi Kou, Jiachun Jin, Zetong Zhou, Ye Ma, Yugang Wang, Quan Chen, Peng Jiang, Xiao Yang, Jun Zhu, Kai Yu, Zhijie Deng
cs.AI
摘要
近期,文本到图像(T2I)扩散模型(DMs)的进展实现了基于多样化文本提示的高质量视觉合成。然而,大多数现有T2I扩散模型——即便是配备基于大语言模型(LLM)的文本编码器的模型——仍停留在文本-像素映射器的阶段:它们仅将LLM用作文本编码器,未能利用其内在推理能力来推断文本提示对应的视觉内容。为突破这种字面化生成的局限,我们提出“先思考后生成”(T2G)范式,通过激励基于LLM的文本编码器对原始用户提示进行推理与重写,并将重写后提示的状态作为扩散条件。为实现这一目标,我们首先通过轻量级监督微调激活LLM编码器的“思考-重写”模式,随后采用Dual-GRPO协同优化LLM编码器与扩散主干网络,确保对上下文的忠实推理和语义的精准呈现。具体而言,文本编码器通过基于图像的奖励机制强化其对世界知识的推断与回忆能力,而扩散主干网络则被推动生成语义一致且视觉连贯的图像。实验表明,在基于推理的图像生成与编辑基准测试中,该方法在事实一致性、语义对齐和视觉真实性方面取得显著提升,WISE分数达到0.79,与GPT-4表现近乎持平。我们的研究成果为构建具备推理、表达与演示能力的下一代统一模型迈出了重要一步。
English
Recent progress in text-to-image (T2I) diffusion models (DMs) has enabled high-quality visual synthesis from diverse textual prompts. Yet, most existing T2I DMs, even those equipped with large language model (LLM)-based text encoders, remain text-pixel mappers -- they employ LLMs merely as text encoders, without leveraging their inherent reasoning capabilities to infer what should be visually depicted given the textual prompt. To move beyond such literal generation, we propose the think-then-generate (T2G) paradigm, where the LLM-based text encoder is encouraged to reason about and rewrite raw user prompts; the states of the rewritten prompts then serve as diffusion conditioning. To achieve this, we first activate the think-then-rewrite pattern of the LLM encoder with a lightweight supervised fine-tuning process. Subsequently, the LLM encoder and diffusion backbone are co-optimized to ensure faithful reasoning about the context and accurate rendering of the semantics via Dual-GRPO. In particular, the text encoder is reinforced using image-grounded rewards to infer and recall world knowledge, while the diffusion backbone is pushed to produce semantically consistent and visually coherent images. Experiments show substantial improvements in factual consistency, semantic alignment, and visual realism across reasoning-based image generation and editing benchmarks, achieving 0.79 on WISE score, nearly on par with GPT-4. Our results constitute a promising step toward next-generation unified models with reasoning, expression, and demonstration capacities.