優勢權重匹配：將強化學習與擴散模型預訓練相結合

摘要

強化學習（Reinforcement Learning, RL）已成為推動大型語言模型（Large Language Models, LLMs）發展的核心範式，其中預訓練與RL後訓練共享相同的對數似然公式。與此相對，近期針對擴散模型的RL方法，尤其是去噪擴散策略優化（Denoising Diffusion Policy Optimization, DDPO），其優化目標與預訓練目標——分數/流匹配損失——有所不同。在本研究中，我們建立了一種新穎的理論分析：DDPO實質上是一種帶有噪聲目標的分數/流匹配的隱式形式，這增加了方差並減緩了收斂速度。基於此分析，我們引入了優勢加權匹配（Advantage Weighted Matching, AWM），這是一種針對擴散模型的策略梯度方法。它採用與預訓練相同的分數/流匹配損失來獲得一個低方差的目標，並根據每個樣本的優勢進行重新加權。實際上，AWM提升了高獎勵樣本的影響力，同時抑制了低獎勵樣本，同時保持建模目標與預訓練一致。這在概念和實踐上統一了預訓練與RL，與策略梯度理論保持一致，降低了方差，並實現了更快的收斂。這一簡潔而有效的設計帶來了顯著的益處：在GenEval、OCR和PickScore基準測試中，當應用於Stable Diffusion 3.5 Medium和FLUX時，AWM相比於Flow-GRPO（基於DDPO構建）實現了高達24倍的加速，且未損害生成質量。代碼可在https://github.com/scxue/advantage_weighted_matching獲取。

English

Reinforcement Learning (RL) has emerged as a central paradigm for advancing Large Language Models (LLMs), where pre-training and RL post-training share the same log-likelihood formulation. In contrast, recent RL approaches for diffusion models, most notably Denoising Diffusion Policy Optimization (DDPO), optimize an objective different from the pretraining objectives--score/flow matching loss. In this work, we establish a novel theoretical analysis: DDPO is an implicit form of score/flow matching with noisy targets, which increases variance and slows convergence. Building on this analysis, we introduce Advantage Weighted Matching (AWM), a policy-gradient method for diffusion. It uses the same score/flow-matching loss as pretraining to obtain a lower-variance objective and reweights each sample by its advantage. In effect, AWM raises the influence of high-reward samples and suppresses low-reward ones while keeping the modeling objective identical to pretraining. This unifies pretraining and RL conceptually and practically, is consistent with policy-gradient theory, reduces variance, and yields faster convergence. This simple yet effective design yields substantial benefits: on GenEval, OCR, and PickScore benchmarks, AWM delivers up to a 24times speedup over Flow-GRPO (which builds on DDPO), when applied to Stable Diffusion 3.5 Medium and FLUX, without compromising generation quality. Code is available at https://github.com/scxue/advantage_weighted_matching.

優勢權重匹配：將強化學習與擴散模型預訓練相結合

Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models

摘要

Support