アドバンテージ重み付きマッチング：拡散モデルにおける事前学習と強化学習の整合

要旨

強化学習（Reinforcement Learning, RL）は、大規模言語モデル（Large Language Models, LLMs）の進歩を促進する中心的なパラダイムとして登場しており、事前学習とRLによる事後学習は同じ対数尤度の定式化を共有している。一方、拡散モデルに対する最近のRLアプローチ、特にDenoising Diffusion Policy Optimization（DDPO）は、事前学習の目的関数とは異なる目的関数、すなわちスコア/フローマッチング損失を最適化する。本研究では、DDPOがノイズを含むターゲットに対するスコア/フローマッチングの暗黙的な形式であり、これが分散を増大させ収束を遅らせるという新たな理論的解析を確立する。この解析に基づき、拡散モデルに対するポリシー勾配法であるAdvantage Weighted Matching（AWM）を提案する。AWMは、事前学習と同じスコア/フローマッチング損失を使用して分散の少ない目的関数を取得し、各サンプルをそのアドバンテージで重み付けする。これにより、AWMは高報酬サンプルの影響を高め、低報酬サンプルを抑制しながら、モデリングの目的を事前学習と同一に保つ。これにより、事前学習とRLが概念的にも実践的に統一され、ポリシー勾配理論と整合し、分散を低減し、より速い収束をもたらす。このシンプルでありながら効果的な設計は、GenEval、OCR、PickScoreベンチマークにおいて、Stable Diffusion 3.5 MediumおよびFLUXに適用した場合、Flow-GRPO（DDPOに基づく）に対して最大24倍の高速化を実現し、生成品質を損なうことなく大きな利点をもたらす。コードはhttps://github.com/scxue/advantage_weighted_mappingで公開されている。

English

Reinforcement Learning (RL) has emerged as a central paradigm for advancing Large Language Models (LLMs), where pre-training and RL post-training share the same log-likelihood formulation. In contrast, recent RL approaches for diffusion models, most notably Denoising Diffusion Policy Optimization (DDPO), optimize an objective different from the pretraining objectives--score/flow matching loss. In this work, we establish a novel theoretical analysis: DDPO is an implicit form of score/flow matching with noisy targets, which increases variance and slows convergence. Building on this analysis, we introduce Advantage Weighted Matching (AWM), a policy-gradient method for diffusion. It uses the same score/flow-matching loss as pretraining to obtain a lower-variance objective and reweights each sample by its advantage. In effect, AWM raises the influence of high-reward samples and suppresses low-reward ones while keeping the modeling objective identical to pretraining. This unifies pretraining and RL conceptually and practically, is consistent with policy-gradient theory, reduces variance, and yields faster convergence. This simple yet effective design yields substantial benefits: on GenEval, OCR, and PickScore benchmarks, AWM delivers up to a 24times speedup over Flow-GRPO (which builds on DDPO), when applied to Stable Diffusion 3.5 Medium and FLUX, without compromising generation quality. Code is available at https://github.com/scxue/advantage_weighted_matching.

アドバンテージ重み付きマッチング：拡散モデルにおける事前学習と強化学習の整合

Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models

要旨

Support