优势加权匹配:在扩散模型中实现强化学习与预训练的对齐
Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models
September 29, 2025
作者: Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, Zhi-Ming Ma
cs.AI
摘要
强化学习(RL)已成为推动大语言模型(LLMs)发展的核心范式,其中预训练与RL后训练共享相同的对数似然公式。相比之下,近期针对扩散模型的RL方法,尤其是去噪扩散策略优化(DDPO),优化的是一个与预训练目标——分数/流匹配损失——不同的目标。在本研究中,我们提出了一种新颖的理论分析:DDPO实质上是一种带有噪声目标的隐式分数/流匹配,这增加了方差并减缓了收敛速度。基于这一分析,我们引入了优势加权匹配(AWM),一种针对扩散模型的策略梯度方法。它采用与预训练相同的分数/流匹配损失来获得一个方差更小的目标,并根据每个样本的优势进行重新加权。实际上,AWM提升了高奖励样本的影响力,同时抑制了低奖励样本,同时保持建模目标与预训练一致。这一设计在概念和实践上统一了预训练与RL,符合策略梯度理论,降低了方差,并实现了更快的收敛。这一简洁而有效的设计带来了显著的优势:在GenEval、OCR和PickScore基准测试中,当应用于Stable Diffusion 3.5 Medium和FLUX时,AWM相比基于DDPO的Flow-GRPO实现了高达24倍的加速,且未牺牲生成质量。代码可在https://github.com/scxue/advantage_weighted_matching获取。
English
Reinforcement Learning (RL) has emerged as a central paradigm for advancing
Large Language Models (LLMs), where pre-training and RL post-training share the
same log-likelihood formulation. In contrast, recent RL approaches for
diffusion models, most notably Denoising Diffusion Policy Optimization (DDPO),
optimize an objective different from the pretraining objectives--score/flow
matching loss. In this work, we establish a novel theoretical analysis: DDPO is
an implicit form of score/flow matching with noisy targets, which increases
variance and slows convergence. Building on this analysis, we introduce
Advantage Weighted Matching (AWM), a policy-gradient method for
diffusion. It uses the same score/flow-matching loss as pretraining to obtain a
lower-variance objective and reweights each sample by its advantage. In effect,
AWM raises the influence of high-reward samples and suppresses low-reward ones
while keeping the modeling objective identical to pretraining. This unifies
pretraining and RL conceptually and practically, is consistent with
policy-gradient theory, reduces variance, and yields faster convergence. This
simple yet effective design yields substantial benefits: on GenEval, OCR, and
PickScore benchmarks, AWM delivers up to a 24times speedup over Flow-GRPO
(which builds on DDPO), when applied to Stable Diffusion 3.5 Medium and FLUX,
without compromising generation quality. Code is available at
https://github.com/scxue/advantage_weighted_matching.