透過獎勵建模提升圖像生成中的空間理解能力

摘要

近期文字到影像生成技術的顯著進展，雖大幅提升了視覺逼真度與創意表現，卻也對提示詞的複雜度提出更高要求——特別是在編碼精細空間關係的場景中。此類情況下，往往需經過多次取樣嘗試方能獲得滿意結果。為應對此挑戰，我們提出一種創新方法，旨在強化現有影像生成模型的空間理解能力。我們首先構建了包含超過8萬組偏好對比的SpatialReward數據集，並基於此開發出SpatialScore評分模型。該獎勵模型專注於評估文字到影像生成中空間關係的準確性，其表現甚至超越現有主流專有模型在空間評估任務上的水準。我們進一步驗證了該獎勵模型能有效驅動複雜空間生成任務的線上強化學習。在多個基準測試中的廣泛實驗表明，此專業化獎勵模型能為影像生成任務帶來顯著且穩定的空間理解能力提升。

English

Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity-particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation.

透過獎勵建模提升圖像生成中的空間理解能力

Enhancing Spatial Understanding in Image Generation via Reward Modeling

摘要

Support