**희소에서 조밀로: 확장된 조건 공간을 통한 흐름 모델용 다중 뷰 GRPO**

초록

그룹 상대 정책 최적화(GRPO)는 텍스트-이미지(T2I) 흐름 모델의 선호도 정렬을 위한 강력한 프레임워크로 부상했습니다. 그러나 단일 조건에 대해 생성된 샘플 그룹을 평가하는 표준 패러다임은 샘플 간 관계 탐색의 부족으로 인해 정렬 효율성과 성능 한계 모두를 제한한다는 점을 관찰했습니다. 이러한 희소 단일 시점 평가 체계를 해결하기 위해, 우리는 조건 공간을 증강하여 조밀한 다중 시점 보상 매핑을 생성함으로써 관계 탐색을 향상하는 새로운 접근법인 다중 시점 GRPO(MV-GRPO)를 제안합니다. 구체적으로, 하나의 프롬프트에서 생성된 샘플 그룹에 대해 MV-GRPO는 유연한 조건 강화기를 활용하여 의미적으로 인접하면서도 다양한 캡션을 생성합니다. 이러한 캡션들은 다양한 의미론적 속성을 포착하고 더 풍부한 최적화 신호를 제공하는 다중 시점 이점 재추정을 가능하게 합니다. 원본 샘플들의 이러한 새로운 캡션들에 대한 조건부 확률 분포를 도출함으로써, 비용이 많이 드는 샘플 재생성 없이도 이를 훈련 과정에 통합할 수 있습니다. 광범위한 실험을 통해 MV-GRPO가 최신 방법들보다 우수한 정렬 성능을 달성함을 입증했습니다.

English

Group Relative Policy Optimization (GRPO) has emerged as a powerful framework for preference alignment in text-to-image (T2I) flow models. However, we observe that the standard paradigm where evaluating a group of generated samples against a single condition suffers from insufficient exploration of inter-sample relationships, constraining both alignment efficacy and performance ceilings. To address this sparse single-view evaluation scheme, we propose Multi-View GRPO (MV-GRPO), a novel approach that enhances relationship exploration by augmenting the condition space to create a dense multi-view reward mapping. Specifically, for a group of samples generated from one prompt, MV-GRPO leverages a flexible Condition Enhancer to generate semantically adjacent yet diverse captions. These captions enable multi-view advantage re-estimation, capturing diverse semantic attributes and providing richer optimization signals. By deriving the probability distribution of the original samples conditioned on these new captions, we can incorporate them into the training process without costly sample regeneration. Extensive experiments demonstrate that MV-GRPO achieves superior alignment performance over state-of-the-art methods.

희소에서 조밀로: 확장된 조건 공간을 통한 흐름 모델용 다중 뷰 GRPO

From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space

초록

Support