Vorteilsgewichtetes Matching: Ausrichtung von Reinforcement Learning mit Vortraining in Diffusionsmodellen

papers.abstract

Reinforcement Learning (RL) hat sich als zentrales Paradigma für die Weiterentwicklung von Large Language Models (LLMs) etabliert, wobei Pre-Training und RL-Post-Training die gleiche Log-Likelihood-Formulierung teilen. Im Gegensatz dazu optimieren neuere RL-Ansätze für Diffusionsmodelle, insbesondere Denoising Diffusion Policy Optimization (DDPO), ein Ziel, das sich von den Pre-Training-Zielen unterscheidet – dem Score/Flow-Matching-Verlust. In dieser Arbeit stellen wir eine neuartige theoretische Analyse vor: DDPO ist eine implizite Form von Score/Flow-Matching mit verrauschten Zielen, was die Varianz erhöht und die Konvergenz verlangsamt. Aufbauend auf dieser Analyse führen wir Advantage Weighted Matching (AWM) ein, eine Policy-Gradient-Methode für Diffusion. Diese Methode verwendet den gleichen Score/Flow-Matching-Verlust wie das Pre-Training, um ein Ziel mit geringerer Varianz zu erreichen, und gewichtet jede Stichprobe nach ihrem Vorteil. Dadurch erhöht AWM den Einfluss von Stichproben mit hoher Belohnung und unterdrückt solche mit niedriger Belohnung, während das Modellierungsziel identisch zum Pre-Training bleibt. Dies vereinheitlicht Pre-Training und RL sowohl konzeptionell als auch praktisch, ist konsistent mit der Policy-Gradient-Theorie, reduziert die Varianz und führt zu einer schnelleren Konvergenz. Dieses einfache, aber effektive Design bringt erhebliche Vorteile mit sich: Auf den Benchmarks GenEval, OCR und PickScore liefert AWM eine bis zu 24-fache Beschleunigung gegenüber Flow-GRPO (das auf DDPO aufbaut), wenn es auf Stable Diffusion 3.5 Medium und FLUX angewendet wird, ohne die Generierungsqualität zu beeinträchtigen. Der Code ist verfügbar unter https://github.com/scxue/advantage_weighted_matching.

English

Reinforcement Learning (RL) has emerged as a central paradigm for advancing Large Language Models (LLMs), where pre-training and RL post-training share the same log-likelihood formulation. In contrast, recent RL approaches for diffusion models, most notably Denoising Diffusion Policy Optimization (DDPO), optimize an objective different from the pretraining objectives--score/flow matching loss. In this work, we establish a novel theoretical analysis: DDPO is an implicit form of score/flow matching with noisy targets, which increases variance and slows convergence. Building on this analysis, we introduce Advantage Weighted Matching (AWM), a policy-gradient method for diffusion. It uses the same score/flow-matching loss as pretraining to obtain a lower-variance objective and reweights each sample by its advantage. In effect, AWM raises the influence of high-reward samples and suppresses low-reward ones while keeping the modeling objective identical to pretraining. This unifies pretraining and RL conceptually and practically, is consistent with policy-gradient theory, reduces variance, and yields faster convergence. This simple yet effective design yields substantial benefits: on GenEval, OCR, and PickScore benchmarks, AWM delivers up to a 24times speedup over Flow-GRPO (which builds on DDPO), when applied to Stable Diffusion 3.5 Medium and FLUX, without compromising generation quality. Code is available at https://github.com/scxue/advantage_weighted_matching.

Vorteilsgewichtetes Matching: Ausrichtung von Reinforcement Learning mit Vortraining in Diffusionsmodellen

Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models

papers.abstract

Support