優勢權重匹配:將強化學習與擴散模型預訓練相結合
Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models
September 29, 2025
作者: Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, Zhi-Ming Ma
cs.AI
摘要
強化學習(Reinforcement Learning, RL)已成為推動大型語言模型(Large Language Models, LLMs)發展的核心範式,其中預訓練與RL後訓練共享相同的對數似然公式。與此相對,近期針對擴散模型的RL方法,尤其是去噪擴散策略優化(Denoising Diffusion Policy Optimization, DDPO),其優化目標與預訓練目標——分數/流匹配損失——有所不同。在本研究中,我們建立了一種新穎的理論分析:DDPO實質上是一種帶有噪聲目標的分數/流匹配的隱式形式,這增加了方差並減緩了收斂速度。基於此分析,我們引入了優勢加權匹配(Advantage Weighted Matching, AWM),這是一種針對擴散模型的策略梯度方法。它採用與預訓練相同的分數/流匹配損失來獲得一個低方差的目標,並根據每個樣本的優勢進行重新加權。實際上,AWM提升了高獎勵樣本的影響力,同時抑制了低獎勵樣本,同時保持建模目標與預訓練一致。這在概念和實踐上統一了預訓練與RL,與策略梯度理論保持一致,降低了方差,並實現了更快的收斂。這一簡潔而有效的設計帶來了顯著的益處:在GenEval、OCR和PickScore基準測試中,當應用於Stable Diffusion 3.5 Medium和FLUX時,AWM相比於Flow-GRPO(基於DDPO構建)實現了高達24倍的加速,且未損害生成質量。代碼可在https://github.com/scxue/advantage_weighted_matching獲取。
English
Reinforcement Learning (RL) has emerged as a central paradigm for advancing
Large Language Models (LLMs), where pre-training and RL post-training share the
same log-likelihood formulation. In contrast, recent RL approaches for
diffusion models, most notably Denoising Diffusion Policy Optimization (DDPO),
optimize an objective different from the pretraining objectives--score/flow
matching loss. In this work, we establish a novel theoretical analysis: DDPO is
an implicit form of score/flow matching with noisy targets, which increases
variance and slows convergence. Building on this analysis, we introduce
Advantage Weighted Matching (AWM), a policy-gradient method for
diffusion. It uses the same score/flow-matching loss as pretraining to obtain a
lower-variance objective and reweights each sample by its advantage. In effect,
AWM raises the influence of high-reward samples and suppresses low-reward ones
while keeping the modeling objective identical to pretraining. This unifies
pretraining and RL conceptually and practically, is consistent with
policy-gradient theory, reduces variance, and yields faster convergence. This
simple yet effective design yields substantial benefits: on GenEval, OCR, and
PickScore benchmarks, AWM delivers up to a 24times speedup over Flow-GRPO
(which builds on DDPO), when applied to Stable Diffusion 3.5 Medium and FLUX,
without compromising generation quality. Code is available at
https://github.com/scxue/advantage_weighted_matching.