Flow-GRPO:通过在线强化学习训练流匹配模型
Flow-GRPO: Training Flow Matching Models via Online RL
May 8, 2025
作者: Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, Wanli Ouyang
cs.AI
摘要
我们提出了Flow-GRPO,这是首个将在线强化学习(RL)融入流匹配模型的方法。我们的方法采用了两大关键策略:(1)通过ODE到SDE的转换,将确定性的常微分方程(ODE)转化为等效的随机微分方程(SDE),该方程在所有时间步上匹配原模型的边际分布,从而为RL探索提供统计采样;(2)去噪缩减策略,在保持原始推理时间步数的同时减少训练中的去噪步骤,显著提升了采样效率而不影响性能。实验表明,Flow-GRPO在多项文本到图像任务中均表现出色。对于复杂构图,经过RL调优的SD3.5模型在物体数量、空间关系及细粒度属性上几乎达到完美,将GenEval准确率从63%提升至95%。在视觉文本渲染方面,其准确率从59%跃升至92%,极大增强了文本生成能力。Flow-GRPO还在人类偏好对齐上取得了显著进展。值得注意的是,几乎没有出现奖励作弊现象,即奖励的提升并未以图像质量或多样性的下降为代价,二者在我们的实验中均保持稳定。
English
We propose Flow-GRPO, the first method integrating online reinforcement
learning (RL) into flow matching models. Our approach uses two key strategies:
(1) an ODE-to-SDE conversion that transforms a deterministic Ordinary
Differential Equation (ODE) into an equivalent Stochastic Differential Equation
(SDE) that matches the original model's marginal distribution at all timesteps,
enabling statistical sampling for RL exploration; and (2) a Denoising Reduction
strategy that reduces training denoising steps while retaining the original
inference timestep number, significantly improving sampling efficiency without
performance degradation. Empirically, Flow-GRPO is effective across multiple
text-to-image tasks. For complex compositions, RL-tuned SD3.5 generates nearly
perfect object counts, spatial relations, and fine-grained attributes, boosting
GenEval accuracy from 63% to 95%. In visual text rendering, its accuracy
improves from 59% to 92%, significantly enhancing text generation.
Flow-GRPO also achieves substantial gains in human preference alignment.
Notably, little to no reward hacking occurred, meaning rewards did not increase
at the cost of image quality or diversity, and both remained stable in our
experiments.Summary
AI-Generated Summary