GoT：釋放多模態大型語言模型的推理能力，實現視覺生成與編輯

摘要

現有的圖像生成與編輯方法主要將文本提示作為直接輸入進行處理，而缺乏對視覺構圖和明確操作的推理。我們提出了生成思維鏈（Generation Chain-of-Thought, GoT），這是一種新穎的範式，通過在輸出圖像之前進行明確的語言推理過程來實現生成和編輯。這種方法將傳統的文本到圖像生成和編輯轉化為一個推理引導的框架，該框架分析語義關係和空間佈局。我們定義了GoT的公式，並構建了包含超過900萬個樣本的大規模GoT數據集，這些樣本詳細記錄了捕捉語義-空間關係的推理鏈。為了充分利用GoT的優勢，我們實現了一個統一的框架，該框架將Qwen2.5-VL用於推理鏈生成，並與一個由我們新穎的語義-空間引導模塊增強的全端到端擴散模型相結合。實驗表明，我們的GoT框架在生成和編輯任務上均表現出色，相較於基準方法有顯著提升。此外，我們的方法支持互動式視覺生成，允許用戶明確修改推理步驟以進行精確的圖像調整。GoT開創了推理驅動的視覺生成和編輯的新方向，生成的圖像更符合人類意圖。為了促進未來研究，我們在https://github.com/rongyaofang/GoT公開了我們的數據集、代碼和預訓練模型。

English

Current image generation and editing methods primarily process textual prompts as direct inputs without reasoning about visual composition and explicit operations. We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements. We define the formulation of GoT and construct large-scale GoT datasets containing over 9M samples with detailed reasoning chains capturing semantic-spatial relationships. To leverage the advantages of GoT, we implement a unified framework that integrates Qwen2.5-VL for reasoning chain generation with an end-to-end diffusion model enhanced by our novel Semantic-Spatial Guidance Module. Experiments show our GoT framework achieves excellent performance on both generation and editing tasks, with significant improvements over baselines. Additionally, our approach enables interactive visual generation, allowing users to explicitly modify reasoning steps for precise image adjustments. GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent. To facilitate future research, we make our datasets, code, and pretrained models publicly available at https://github.com/rongyaofang/GoT.