优势加权匹配：在扩散模型中实现强化学习与预训练的对齐

摘要

强化学习（RL）已成为推动大语言模型（LLMs）发展的核心范式，其中预训练与RL后训练共享相同的对数似然公式。相比之下，近期针对扩散模型的RL方法，尤其是去噪扩散策略优化（DDPO），优化的是一个与预训练目标——分数/流匹配损失——不同的目标。在本研究中，我们提出了一种新颖的理论分析：DDPO实质上是一种带有噪声目标的隐式分数/流匹配，这增加了方差并减缓了收敛速度。基于这一分析，我们引入了优势加权匹配（AWM），一种针对扩散模型的策略梯度方法。它采用与预训练相同的分数/流匹配损失来获得一个方差更小的目标，并根据每个样本的优势进行重新加权。实际上，AWM提升了高奖励样本的影响力，同时抑制了低奖励样本，同时保持建模目标与预训练一致。这一设计在概念和实践上统一了预训练与RL，符合策略梯度理论，降低了方差，并实现了更快的收敛。这一简洁而有效的设计带来了显著的优势：在GenEval、OCR和PickScore基准测试中，当应用于Stable Diffusion 3.5 Medium和FLUX时，AWM相比基于DDPO的Flow-GRPO实现了高达24倍的加速，且未牺牲生成质量。代码可在https://github.com/scxue/advantage_weighted_matching获取。

English

Reinforcement Learning (RL) has emerged as a central paradigm for advancing Large Language Models (LLMs), where pre-training and RL post-training share the same log-likelihood formulation. In contrast, recent RL approaches for diffusion models, most notably Denoising Diffusion Policy Optimization (DDPO), optimize an objective different from the pretraining objectives--score/flow matching loss. In this work, we establish a novel theoretical analysis: DDPO is an implicit form of score/flow matching with noisy targets, which increases variance and slows convergence. Building on this analysis, we introduce Advantage Weighted Matching (AWM), a policy-gradient method for diffusion. It uses the same score/flow-matching loss as pretraining to obtain a lower-variance objective and reweights each sample by its advantage. In effect, AWM raises the influence of high-reward samples and suppresses low-reward ones while keeping the modeling objective identical to pretraining. This unifies pretraining and RL conceptually and practically, is consistent with policy-gradient theory, reduces variance, and yields faster convergence. This simple yet effective design yields substantial benefits: on GenEval, OCR, and PickScore benchmarks, AWM delivers up to a 24times speedup over Flow-GRPO (which builds on DDPO), when applied to Stable Diffusion 3.5 Medium and FLUX, without compromising generation quality. Code is available at https://github.com/scxue/advantage_weighted_matching.

优势加权匹配：在扩散模型中实现强化学习与预训练的对齐

Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models

摘要

Support