GoT-R1:透過強化學習釋放多模態大語言模型的視覺生成推理能力
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
May 22, 2025
作者: Chengqi Duan, Rongyao Fang, Yuqing Wang, Kun Wang, Linjiang Huang, Xingyu Zeng, Hongsheng Li, Xihui Liu
cs.AI
摘要
視覺生成模型在根據文字提示創造逼真圖像方面取得了顯著進展,但在處理涉及多個物體及其精確空間關係與屬性的複雜提示時仍面臨挑戰。有效處理此類提示需要對語義內容和空間佈局進行明確推理。我們提出了GoT-R1框架,該框架應用強化學習來增強視覺生成中的語義-空間推理能力。基於“生成思維鏈”方法,GoT-R1使模型能夠通過精心設計的強化學習,自主發現超越預定義模板的有效推理策略。為實現這一目標,我們提出了一個雙階段多維度獎勵框架,該框架利用多模態大語言模型(MLLMs)來評估推理過程和最終輸出,從而實現對整個生成流程的有效監督。該獎勵系統統一評估語義對齊、空間準確性和視覺質量。實驗結果顯示,在T2I-CompBench基準測試中,特別是在涉及精確空間關係和屬性綁定的組合任務上,GoT-R1取得了顯著提升。通過成功將複雜的推理能力轉移至視覺生成領域,GoT-R1推動了圖像生成技術的前沿發展。為促進未來研究,我們將代碼和預訓練模型公開於https://github.com/gogoduan/GoT-R1。
English
Visual generation models have made remarkable progress in creating realistic
images from text prompts, yet struggle with complex prompts that specify
multiple objects with precise spatial relationships and attributes. Effective
handling of such prompts requires explicit reasoning about the semantic content
and spatial layout. We present GoT-R1, a framework that applies reinforcement
learning to enhance semantic-spatial reasoning in visual generation. Building
upon the Generation Chain-of-Thought approach, GoT-R1 enables models to
autonomously discover effective reasoning strategies beyond predefined
templates through carefully designed reinforcement learning. To achieve this,
we propose a dual-stage multi-dimensional reward framework that leverages MLLMs
to evaluate both the reasoning process and final output, enabling effective
supervision across the entire generation pipeline. The reward system assesses
semantic alignment, spatial accuracy, and visual quality in a unified approach.
Experimental results demonstrate significant improvements on T2I-CompBench
benchmark, particularly in compositional tasks involving precise spatial
relationships and attribute binding. GoT-R1 advances the state-of-the-art in
image generation by successfully transferring sophisticated reasoning
capabilities to the visual generation domain. To facilitate future research, we
make our code and pretrained models publicly available at
https://github.com/gogoduan/GoT-R1.Summary
AI-Generated Summary