思而後繪:具推理感知的文本生成圖像擴散模型與大型語言模型編碼器
Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders
January 15, 2026
作者: Siqi Kou, Jiachun Jin, Zetong Zhou, Ye Ma, Yugang Wang, Quan Chen, Peng Jiang, Xiao Yang, Jun Zhu, Kai Yu, Zhijie Deng
cs.AI
摘要
近期文字轉圖像(T2I)擴散模型(DMs)的進展已能根據多樣化文本提示生成高品質視覺內容。然而,現有大多數T2I擴散模型(包括配備大型語言模型(LLM)文本編碼器的版本)仍僅是文字-像素對應器——它們僅將LLM用作文本編碼器,未能利用其內在推理能力來推斷文本提示應對應的視覺呈現。為突破這種字面化生成模式,我們提出「先思考後生成」(T2G)範式,通過激勵基於LLM的文本編碼器對原始用戶提示進行推理與重寫,並將重寫後的提示狀態作為擴散條件。為實現此目標,我們首先透過輕量級監督微調激活LLM編碼器的「思考-重寫」模式,隨後採用Dual-GRPO協同優化LLM編碼器與擴散骨幹網絡,確保對上下文進行忠實推理並精準呈現語義。具體而言,文本編碼器通過基於圖像的獎勵強化來推斷與喚醒世界知識,而擴散骨幹則被驅動生成語義一致且視覺連貫的圖像。實驗結果顯示,在基於推理的圖像生成與編輯基準測試中,該方法在事實一致性、語義對齊和視覺真實性方面取得顯著提升,WISE分數達0.79,幾乎與GPT-4持平。我們的成果為構建具備推理、表達與演示能力的下一代統一模型邁出重要一步。
English
Recent progress in text-to-image (T2I) diffusion models (DMs) has enabled high-quality visual synthesis from diverse textual prompts. Yet, most existing T2I DMs, even those equipped with large language model (LLM)-based text encoders, remain text-pixel mappers -- they employ LLMs merely as text encoders, without leveraging their inherent reasoning capabilities to infer what should be visually depicted given the textual prompt. To move beyond such literal generation, we propose the think-then-generate (T2G) paradigm, where the LLM-based text encoder is encouraged to reason about and rewrite raw user prompts; the states of the rewritten prompts then serve as diffusion conditioning. To achieve this, we first activate the think-then-rewrite pattern of the LLM encoder with a lightweight supervised fine-tuning process. Subsequently, the LLM encoder and diffusion backbone are co-optimized to ensure faithful reasoning about the context and accurate rendering of the semantics via Dual-GRPO. In particular, the text encoder is reinforced using image-grounded rewards to infer and recall world knowledge, while the diffusion backbone is pushed to produce semantically consistent and visually coherent images. Experiments show substantial improvements in factual consistency, semantic alignment, and visual realism across reasoning-based image generation and editing benchmarks, achieving 0.79 on WISE score, nearly on par with GPT-4. Our results constitute a promising step toward next-generation unified models with reasoning, expression, and demonstration capacities.