GoT-R1：通过强化学习释放多模态大语言模型的视觉生成推理能力

摘要

视觉生成模型在根据文本提示创建逼真图像方面取得了显著进展，但在处理涉及多个对象及其精确空间关系和属性的复杂提示时仍面临挑战。有效处理此类提示需要对语义内容和空间布局进行显式推理。我们提出了GoT-R1框架，该框架应用强化学习来增强视觉生成中的语义-空间推理能力。基于生成思维链方法，GoT-R1通过精心设计的强化学习，使模型能够自主发现超越预定义模板的有效推理策略。为此，我们提出了一个双阶段多维奖励框架，利用多模态大语言模型（MLLMs）评估推理过程和最终输出，从而在整个生成流程中实现有效监督。该奖励系统以统一的方式评估语义对齐、空间准确性和视觉质量。实验结果表明，在T2I-CompBench基准测试中，特别是在涉及精确空间关系和属性绑定的组合任务上，GoT-R1取得了显著提升。通过成功将复杂的推理能力迁移到视觉生成领域，GoT-R1推动了图像生成技术的前沿发展。为促进未来研究，我们在https://github.com/gogoduan/GoT-R1公开了代码和预训练模型。

English

Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. Building upon the Generation Chain-of-Thought approach, GoT-R1 enables models to autonomously discover effective reasoning strategies beyond predefined templates through carefully designed reinforcement learning. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified approach. Experimental results demonstrate significant improvements on T2I-CompBench benchmark, particularly in compositional tasks involving precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art in image generation by successfully transferring sophisticated reasoning capabilities to the visual generation domain. To facilitate future research, we make our code and pretrained models publicly available at https://github.com/gogoduan/GoT-R1.