通过奖励建模增强图像生成的空间理解能力

摘要

近期文本到图像生成技术的进展显著提升了视觉保真度与创造性，但同时也对提示词的复杂性提出了更高要求——尤其是在编码复杂空间关系方面。此类场景下，要获得令人满意的结果往往需要多次采样尝试。为应对这一挑战，我们提出了一种创新方法以增强现有图像生成模型的空间理解能力。我们首先构建了包含超过8万组偏好对比数据的SpatialReward数据集，并在此基础上开发出SpatialScore评分模型。该奖励模型专门用于评估文本到图像生成中的空间关系准确性，其性能甚至在空间评估指标上超越了主流专有模型。我们进一步证明，该奖励模型能有效支持复杂空间生成的在线强化学习。在多个基准测试上的大量实验表明，我们专门设计的奖励模型能为图像生成任务带来显著且持续的空间理解能力提升。

English

Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity-particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation.