어드밴티지 가중 매칭: 확산 모델에서 사전 학습과 강화 학습의 정렬

초록

강화학습(Reinforcement Learning, RL)은 대형 언어 모델(Large Language Models, LLMs)의 발전을 위한 핵심 패러다임으로 부상하였으며, 사전 학습과 RL 사후 학습은 동일한 로그-우도(log-likelihood) 공식을 공유합니다. 이와 대조적으로, 최근 확산 모델(diffusion models)을 위한 RL 접근법, 특히 Denoising Diffusion Policy Optimization(DDPO)은 사전 학습 목표와 다른 목적 함수를 최적화합니다—스코어/플로우 매칭 손실(score/flow matching loss). 본 연구에서 우리는 새로운 이론적 분석을 제시합니다: DDPO는 잡음이 포함된 타겟을 사용한 스코어/플로우 매칭의 암묵적 형태로, 이는 분산을 증가시키고 수렴 속도를 늦춥니다. 이 분석을 바탕으로, 우리는 확산 모델을 위한 정책 경사(policy-gradient) 방법인 Advantage Weighted Matching(AWM)을 소개합니다. AWM은 사전 학습과 동일한 스코어/플로우 매칭 손실을 사용하여 더 낮은 분산의 목적 함수를 얻고, 각 샘플을 그 이점(advantage)에 따라 재가중합니다. 결과적으로, AWM은 고보상 샘플의 영향을 높이고 저보상 샘플을 억제하면서도 모델링 목적을 사전 학습과 동일하게 유지합니다. 이는 사전 학습과 RL을 개념적으로 그리고 실질적으로 통일하며, 정책 경사 이론과 일치하고, 분산을 줄이며, 더 빠른 수렴을 이끌어냅니다. 이 간단하지만 효과적인 설계는 상당한 이점을 제공합니다: GenEval, OCR, 그리고 PickScore 벤치마크에서, AWM은 Stable Diffusion 3.5 Medium과 FLUX에 적용될 때 Flow-GRPO(DDPO를 기반으로 한) 대비 최대 24배의 속도 향상을 달성하며, 생성 품질을 저하시키지 않습니다. 코드는 https://github.com/scxue/advantage_weighted_matching에서 확인할 수 있습니다.

English

Reinforcement Learning (RL) has emerged as a central paradigm for advancing Large Language Models (LLMs), where pre-training and RL post-training share the same log-likelihood formulation. In contrast, recent RL approaches for diffusion models, most notably Denoising Diffusion Policy Optimization (DDPO), optimize an objective different from the pretraining objectives--score/flow matching loss. In this work, we establish a novel theoretical analysis: DDPO is an implicit form of score/flow matching with noisy targets, which increases variance and slows convergence. Building on this analysis, we introduce Advantage Weighted Matching (AWM), a policy-gradient method for diffusion. It uses the same score/flow-matching loss as pretraining to obtain a lower-variance objective and reweights each sample by its advantage. In effect, AWM raises the influence of high-reward samples and suppresses low-reward ones while keeping the modeling objective identical to pretraining. This unifies pretraining and RL conceptually and practically, is consistent with policy-gradient theory, reduces variance, and yields faster convergence. This simple yet effective design yields substantial benefits: on GenEval, OCR, and PickScore benchmarks, AWM delivers up to a 24times speedup over Flow-GRPO (which builds on DDPO), when applied to Stable Diffusion 3.5 Medium and FLUX, without compromising generation quality. Code is available at https://github.com/scxue/advantage_weighted_matching.

어드밴티지 가중 매칭: 확산 모델에서 사전 학습과 강화 학습의 정렬

Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models

초록

Support