ChatPaper.aiChatPaper

Flow-GRPO:通過線上強化學習訓練流匹配模型

Flow-GRPO: Training Flow Matching Models via Online RL

May 8, 2025
作者: Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, Wanli Ouyang
cs.AI

摘要

我們提出了Flow-GRPO,這是首個將線上強化學習(RL)整合到流匹配模型中的方法。我們的方法採用了兩個關鍵策略:(1)ODE到SDE的轉換,將確定性的常微分方程(ODE)轉化為等效的隨機微分方程(SDE),該方程在所有時間步上匹配原始模型的邊際分佈,從而實現了RL探索的統計採樣;(2)去噪減縮策略,在保持原始推理時間步數的同時,減少訓練中的去噪步驟,顯著提高了採樣效率而不影響性能。實驗表明,Flow-GRPO在多種文本到圖像任務中均表現出色。對於複雜的構圖,經過RL調優的SD3.5能夠生成近乎完美的物件數量、空間關係和細粒度屬性,將GenEval的準確率從63%提升至95%。在視覺文本渲染方面,其準確率從59%提升至92%,顯著增強了文本生成能力。Flow-GRPO在人類偏好對齊方面也取得了顯著進展。值得注意的是,幾乎沒有出現獎勵欺騙現象,即獎勵的增加並未以圖像質量或多樣性為代價,兩者在我們的實驗中均保持穩定。
English
We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps, enabling statistical sampling for RL exploration; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original inference timestep number, significantly improving sampling efficiency without performance degradation. Empirically, Flow-GRPO is effective across multiple text-to-image tasks. For complex compositions, RL-tuned SD3.5 generates nearly perfect object counts, spatial relations, and fine-grained attributes, boosting GenEval accuracy from 63% to 95%. In visual text rendering, its accuracy improves from 59% to 92%, significantly enhancing text generation. Flow-GRPO also achieves substantial gains in human preference alignment. Notably, little to no reward hacking occurred, meaning rewards did not increase at the cost of image quality or diversity, and both remained stable in our experiments.

Summary

AI-Generated Summary

PDF342May 9, 2025