V-GRPO: 잡음 제거 생성 모델을 위한 온라인 강화 학습은 생각보다 쉽다

초록

잡음 제거 생성 모델을 인간의 선호도나 검증 가능한 보상과 정렬하는 것은 여전히 중요한 과제로 남아 있습니다. 정책 경사 기반 온라인 강화학습(RL)은 원칙적으로 훈련 후 조정을 위한 체계적인 프레임워크를 제공하지만, 이러한 모델의 다루기 힘든 가능도로 인해 직접 적용에는 어려움이 있습니다. 따라서 기존 연구는 샘플링 경로에 대해 유도된 마르코프 결정 과정(MDP)을 최적화하는(안정적이지만 비효율적) 방법, 또는 확산 증거 하한(ELBO)에 기반한 가능도 대용 함수를 사용하는(시각적 생성 작업에서 아직까지 성능이 낮은) 방법으로 나뉘어 왔습니다. 우리의 핵심 통찰은 ELBO 기반 접근법이 사실상 안정성과 효율성을 모두 갖출 수 있다는 점입니다. 대용 함수의 분산을 줄이고 경사 단계를 제어함으로써, 이 접근법이 MDP 기반 방법을 능가할 수 있음을 보여줍니다. 이를 위해 우리는 Variational GRPO(V-GRPO)를 소개합니다. 이 방법은 ELBO 기반 대용 함수를 Group Relative Policy Optimization(GRPO) 알고리즘과 통합하며, 간단하지만 필수적인 기술 세트를 함께 사용합니다. 우리의 방법은 구현이 쉽고, 사전 훈련 목표와 조화를 이루며, MDP 기반 방법의 한계를 피합니다. V-GRPO는 텍스트-이미지 합성 분야에서 최첨단 성능을 달성하는 동시에 MixGRPO 대비 2배, DiffusionNFT 대비 3배의 속도 향상을 제공합니다.

English

Aligning denoising generative models with human preferences or verifiable rewards remains a key challenge. While policy-gradient online reinforcement learning (RL) offers a principled post-training framework, its direct application is hindered by the intractable likelihoods of these models. Prior work therefore either optimizes an induced Markov decision process (MDP) over sampling trajectories, which is stable but inefficient, or uses likelihood surrogates based on the diffusion evidence lower bound (ELBO), which have so far underperformed on visual generation. Our key insight is that the ELBO-based approach can, in fact, be made both stable and efficient. By reducing surrogate variance and controlling gradient steps, we show that this approach can beat MDP-based methods. To this end, we introduce Variational GRPO (V-GRPO), a method that integrates ELBO-based surrogates with the Group Relative Policy Optimization (GRPO) algorithm, alongside a set of simple yet essential techniques. Our method is easy to implement, aligns with pretraining objectives, and avoids the limitations of MDP-based methods. V-GRPO achieves state-of-the-art performance in text-to-image synthesis, while delivering a 2times speedup over MixGRPO and a 3times speedup over DiffusionNFT.

V-GRPO: 잡음 제거 생성 모델을 위한 온라인 강화 학습은 생각보다 쉽다

V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

초록

Support