報酬モデリングによる画像生成の空間理解の向上

要旨

近年のテキストから画像への生成技術は、視覚的な忠実度と創造性を大きく進歩させてきたが、それと同時にプロンプトの複雑さ、特に複雑な空間関係の符号化に対する要求も高めている。このような場合、満足のいく結果を得るには、複数回のサンプリング試行が必要となることが多い。この課題に対処するため、我々は現在の画像生成モデルの空間理解を強化する新規手法を提案する。まず、8万組以上の選好ペアからなるSpatialRewardデータセットを構築した。このデータセットに基づき、テキストから画像生成における空間関係の正確性を評価する報酬モデルであるSpatialScoreを開発し、空間評価において主要なプロプライエタリモデルを凌駕する性能を達成した。さらに、この報酬モデルが複雑な空間生成に対するオンライン強化学習を効果的に可能にすることを実証する。複数のベンチマークによる大規模な実験を通じて、専門化された報酬モデルが画像生成における空間理解に有意かつ一貫した改善をもたらすことを示す。

English

Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity-particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation.

報酬モデリングによる画像生成の空間理解の向上

Enhancing Spatial Understanding in Image Generation via Reward Modeling

要旨

Support