スパースからデンスへ：拡張条件空間によるフローモデルのためのマルチビューGRPO

要旨

Group Relative Policy Optimization（GRPO）は、テキストから画像への生成フローモデルにおける選好調整の強力なフレームワークとして登場した。しかし、単一の条件に対して生成された一群のサンプルを評価する標準的なパラダイムは、サンプル間の関係性の探索が不十分であり、調整効果と性能の上限の両方を制約していることが観察される。この疎な単一視点評価スキームに対処するため、我々は条件空間を拡張して密な多視点報酬マッピングを生成し、関係性探索を強化する新しいアプローチであるMulti-View GRPO（MV-GRPO）を提案する。具体的には、一つのプロンプトから生成されたサンプル群に対して、MV-GRPOは柔軟な条件エンハンサーを利用して、意味的に隣接しつつ多様なキャプションを生成する。これらのキャプションは多視点アドバンテージ再推定を可能にし、多様な意味的属性を捉え、より豊富な最適化信号を提供する。元のサンプル群の、これらの新しいキャプションを条件とした確率分布を導出することで、コストのかかるサンプル再生成を行うことなく、それらを訓練プロセスに組み込むことができる。大規模な実験により、MV-GRPOが最先端手法を上回る優れた調整性能を達成することが実証された。

English

Group Relative Policy Optimization (GRPO) has emerged as a powerful framework for preference alignment in text-to-image (T2I) flow models. However, we observe that the standard paradigm where evaluating a group of generated samples against a single condition suffers from insufficient exploration of inter-sample relationships, constraining both alignment efficacy and performance ceilings. To address this sparse single-view evaluation scheme, we propose Multi-View GRPO (MV-GRPO), a novel approach that enhances relationship exploration by augmenting the condition space to create a dense multi-view reward mapping. Specifically, for a group of samples generated from one prompt, MV-GRPO leverages a flexible Condition Enhancer to generate semantically adjacent yet diverse captions. These captions enable multi-view advantage re-estimation, capturing diverse semantic attributes and providing richer optimization signals. By deriving the probability distribution of the original samples conditioned on these new captions, we can incorporate them into the training process without costly sample regeneration. Extensive experiments demonstrate that MV-GRPO achieves superior alignment performance over state-of-the-art methods.

スパースからデンスへ：拡張条件空間によるフローモデルのためのマルチビューGRPO

From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space

要旨

Support