GoT:釋放多模態大型語言模型的推理能力,實現視覺生成與編輯
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
March 13, 2025
作者: Rongyao Fang, Chengqi Duan, Kun Wang, Linjiang Huang, Hao Li, Shilin Yan, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Xihui Liu, Hongsheng Li
cs.AI
摘要
現有的圖像生成與編輯方法主要將文本提示作為直接輸入進行處理,而缺乏對視覺構圖和明確操作的推理。我們提出了生成思維鏈(Generation Chain-of-Thought, GoT),這是一種新穎的範式,通過在輸出圖像之前進行明確的語言推理過程來實現生成和編輯。這種方法將傳統的文本到圖像生成和編輯轉化為一個推理引導的框架,該框架分析語義關係和空間佈局。我們定義了GoT的公式,並構建了包含超過900萬個樣本的大規模GoT數據集,這些樣本詳細記錄了捕捉語義-空間關係的推理鏈。為了充分利用GoT的優勢,我們實現了一個統一的框架,該框架將Qwen2.5-VL用於推理鏈生成,並與一個由我們新穎的語義-空間引導模塊增強的全端到端擴散模型相結合。實驗表明,我們的GoT框架在生成和編輯任務上均表現出色,相較於基準方法有顯著提升。此外,我們的方法支持互動式視覺生成,允許用戶明確修改推理步驟以進行精確的圖像調整。GoT開創了推理驅動的視覺生成和編輯的新方向,生成的圖像更符合人類意圖。為了促進未來研究,我們在https://github.com/rongyaofang/GoT公開了我們的數據集、代碼和預訓練模型。
English
Current image generation and editing methods primarily process textual
prompts as direct inputs without reasoning about visual composition and
explicit operations. We present Generation Chain-of-Thought (GoT), a novel
paradigm that enables generation and editing through an explicit language
reasoning process before outputting images. This approach transforms
conventional text-to-image generation and editing into a reasoning-guided
framework that analyzes semantic relationships and spatial arrangements. We
define the formulation of GoT and construct large-scale GoT datasets containing
over 9M samples with detailed reasoning chains capturing semantic-spatial
relationships. To leverage the advantages of GoT, we implement a unified
framework that integrates Qwen2.5-VL for reasoning chain generation with an
end-to-end diffusion model enhanced by our novel Semantic-Spatial Guidance
Module. Experiments show our GoT framework achieves excellent performance on
both generation and editing tasks, with significant improvements over
baselines. Additionally, our approach enables interactive visual generation,
allowing users to explicitly modify reasoning steps for precise image
adjustments. GoT pioneers a new direction for reasoning-driven visual
generation and editing, producing images that better align with human intent.
To facilitate future research, we make our datasets, code, and pretrained
models publicly available at https://github.com/rongyaofang/GoT.Summary
AI-Generated Summary