从稀疏到密集:基于增强条件空间的多视角GRPO流模型方法
From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space
March 13, 2026
作者: Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang, Tianyi Wei, Xiaohang Zhan, Jiaqi Wang, Tong Wu, Xingang Pan, Dahua Lin
cs.AI
摘要
群组相对策略优化(GRPO)已成为文本到图像(T2I)流模型中实现偏好对齐的强大框架。然而我们观察到,当前基于单一条件评估生成样本群的标准范式存在样本间关系探索不足的问题,这既限制了对齐效能,也制约了性能上限。为改进这种稀疏的单视角评估机制,我们提出多视角GRPO(MV-GRPO),通过扩展条件空间构建稠密的多视角奖励映射,从而增强关系探索能力。具体而言,对于同一提示词生成的一组样本,MV-GRPO利用灵活的条件增强器生成语义相邻且多样化的描述文本。这些文本支持多视角优势度重估计,能够捕捉多样语义属性并提供更丰富的优化信号。通过推导原始样本在这些新描述条件下的概率分布,我们可在无需昂贵样本重新生成的情况下将其融入训练过程。大量实验表明,MV-GRPO在对齐性能上显著优于现有最优方法。
English
Group Relative Policy Optimization (GRPO) has emerged as a powerful framework for preference alignment in text-to-image (T2I) flow models. However, we observe that the standard paradigm where evaluating a group of generated samples against a single condition suffers from insufficient exploration of inter-sample relationships, constraining both alignment efficacy and performance ceilings. To address this sparse single-view evaluation scheme, we propose Multi-View GRPO (MV-GRPO), a novel approach that enhances relationship exploration by augmenting the condition space to create a dense multi-view reward mapping. Specifically, for a group of samples generated from one prompt, MV-GRPO leverages a flexible Condition Enhancer to generate semantically adjacent yet diverse captions. These captions enable multi-view advantage re-estimation, capturing diverse semantic attributes and providing richer optimization signals. By deriving the probability distribution of the original samples conditioned on these new captions, we can incorporate them into the training process without costly sample regeneration. Extensive experiments demonstrate that MV-GRPO achieves superior alignment performance over state-of-the-art methods.