从稀疏到密集：基于增强条件空间的多视角GRPO流模型优化

摘要

群体相对策略优化（GRPO）已成为文本到图像（T2I）流模型中实现偏好对齐的强大框架。然而，我们观察到当前标准范式存在局限：基于单一条件评估生成样本群时，对样本间关系的探索不足，这制约了对齐效果和性能上限。针对这种稀疏的单视角评估机制，我们提出多视角GRPO（MV-GRPO），通过扩展条件空间构建稠密的多视角奖励映射，以增强关系探索能力。具体而言，对于同一提示词生成的一组样本，MV-GRPO利用灵活的条件增强器生成语义相邻且多样化的描述文本。这些文本支持多视角优势度重估计，能捕获多样化的语义属性并提供更丰富的优化信号。通过推导原始样本基于新描述文本的条件概率分布，我们无需 costly 的样本重新生成即可将其融入训练过程。大量实验表明，MV-GRPO在对齐性能上显著优于现有最优方法。

English

Group Relative Policy Optimization (GRPO) has emerged as a powerful framework for preference alignment in text-to-image (T2I) flow models. However, we observe that the standard paradigm where evaluating a group of generated samples against a single condition suffers from insufficient exploration of inter-sample relationships, constraining both alignment efficacy and performance ceilings. To address this sparse single-view evaluation scheme, we propose Multi-View GRPO (MV-GRPO), a novel approach that enhances relationship exploration by augmenting the condition space to create a dense multi-view reward mapping. Specifically, for a group of samples generated from one prompt, MV-GRPO leverages a flexible Condition Enhancer to generate semantically adjacent yet diverse captions. These captions enable multi-view advantage re-estimation, capturing diverse semantic attributes and providing richer optimization signals. By deriving the probability distribution of the original samples conditioned on these new captions, we can incorporate them into the training process without costly sample regeneration. Extensive experiments demonstrate that MV-GRPO achieves superior alignment performance over state-of-the-art methods.