Flow-DPPO: 흐름 매칭 모델을 위한 발산 근접 정책 최적화

초록

최근 연구들은 온라인 강화 학습(RL)이 이미지 및 비디오 생성을 위한 흐름 정합 모델의 품질과 정렬을 실질적으로 향상시킬 수 있음을 보여주었다. Flow-GRPO와 CPS와 같은 방법들은 잡음 제거 과정을 마르코프 결정 과정으로 간주하고 PPO 스타일의 비율 클리핑을 적용하여 신뢰 영역을 강제한다. 그러나 우리는 비율 클리핑이 흐름 모델에 구조적으로 부적합하다고 주장한다. 새 정책과 기존 정책 간의 확률 비율은 실제 정책 발산에 대한 잡음이 섞인 단일 샘플 추정치에 불과하므로, 궤적의 일부 영역에서는 과도하게 제약하고 다른 영역에서는 충분히 제약하지 못한다. 우리는 비율 클리핑을 발산 근접 제약으로 대체하는 Flow-DPPO(Flow Divergence Proximal Policy Optimization)를 제안한다. 핵심 관찰은 흐름 모델의 단계별 정책이 가우시안 분포를 따르므로, 기존 정책과 새 정책 간의 KL 발산을 정확하고 저비용으로 계산할 수 있다는 점이다. Flow-DPPO는 비대칭 발산 마스크를 사용하여, 신뢰 영역에서 벗어나면서 동시에 발산 임계값을 위반하는 경우에만 그래디언트 업데이트를 차단한다. 실험 결과, Flow-DPPO는 더 높은 보상을 달성하면서도 KL 근접 효율이 우수하고, 파국적 망각을 완화하며, 균형 잡힌 다중 목표 최적화를 촉진하고, 비율 클리핑이 성능 저하를 일으키는 다중 에폭 훈련에서도 안정적인 학습을 가능하게 함을 보여준다. 코드와 모델은 https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO 에서 확인할 수 있다.

English

Recent work has demonstrated that online reinforcement learning (RL) can substantially improve the quality and alignment of flow matching models for image and video generation. Methods such as Flow-GRPO and CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. However, we argue that ratio clipping is structurally ill-suited for flow models: the probability ratio between new and old policies is a noisy, single-sample estimate of the true policy divergence, leading to over-constraining in some regions of the trajectory and under-constraining in others. We propose Flow-DPPO (Flow Divergence Proximal Policy Optimization), which replaces ratio clipping with a divergence proximal constraint. A key observation is that the per-step policy in flow models is Gaussian, enabling exact and cheap computation of the KL divergence between old and new policies. Flow-DPPO employs an asymmetric divergence mask that blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold. Experiments show that Flow-DPPO achieves higher rewards with better KL-proximal efficiency, alleviates catastrophic forgetting, promotes balanced multi-objective optimization, and enables stable multi-epoch training where ratio clipping degrades. Code and models are available at https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO.