PRDP: 확산 모델의 대규모 보정 미세 조정을 위한 근접 보상 차이 예측

초록

보상 미세 조정(Finetuning)은 기초 모델을 하위 작업 목표에 맞추는 유망한 접근법으로 부상했습니다. 인간의 선호도를 반영하는 보상을 최대화하기 위해 강화 학습(Reinforcement Learning, RL)을 사용함으로써 언어 분야에서 주목할 만한 성과를 거두었습니다. 그러나 비전 분야에서는 기존의 RL 기반 보상 미세 조정 방법들이 대규모 훈련에서의 불안정성으로 인해 제한적이며, 복잡하고 보지 못한 프롬프트에 일반화하는 데 실패하고 있습니다. 본 논문에서는 Proximal Reward Difference Prediction(PRDP)을 제안하여, 10만 개 이상의 프롬프트로 구성된 대규모 데이터셋에서 확산 모델(Diffusion Model)에 대한 안정적인 블랙박스 보상 미세 조정을 처음으로 가능하게 합니다. 우리의 핵심 혁신은 RL 목표와 동일한 최적 해를 가지면서도 더 나은 훈련 안정성을 제공하는 Reward Difference Prediction(RDP) 목표입니다. 구체적으로, RDP 목표는 확산 모델이 생성된 이미지 쌍의 보상 차이를 디노이징(Denoising) 궤적에서 예측하도록 하는 지도 회귀(Supervised Regression) 목표입니다. 우리는 이론적으로 완벽한 보상 차이 예측을 달성하는 확산 모델이 RL 목표의 최대화자와 정확히 일치함을 증명합니다. 또한, RDP 목표를 안정적으로 최적화하기 위해 근접 업데이트(Proximal Update)를 사용한 온라인 알고리즘을 개발했습니다. 실험에서 PRDP는 소규모 훈련에서 잘 정립된 RL 기반 방법들의 보상 최대화 능력과 동등한 성능을 보여줍니다. 더 나아가, Human Preference Dataset v2와 Pick-a-Pic v1 데이터셋의 텍스트 프롬프트에 대한 대규모 훈련을 통해 PRDP는 다양한 복잡하고 보지 못한 프롬프트에서 우수한 생성 품질을 달성한 반면, RL 기반 방법들은 완전히 실패했습니다.

English

Reward finetuning has emerged as a promising approach to aligning foundation models with downstream objectives. Remarkable success has been achieved in the language domain by using reinforcement learning (RL) to maximize rewards that reflect human preference. However, in the vision domain, existing RL-based reward finetuning methods are limited by their instability in large-scale training, rendering them incapable of generalizing to complex, unseen prompts. In this paper, we propose Proximal Reward Difference Prediction (PRDP), enabling stable black-box reward finetuning for diffusion models for the first time on large-scale prompt datasets with over 100K prompts. Our key innovation is the Reward Difference Prediction (RDP) objective that has the same optimal solution as the RL objective while enjoying better training stability. Specifically, the RDP objective is a supervised regression objective that tasks the diffusion model with predicting the reward difference of generated image pairs from their denoising trajectories. We theoretically prove that the diffusion model that obtains perfect reward difference prediction is exactly the maximizer of the RL objective. We further develop an online algorithm with proximal updates to stably optimize the RDP objective. In experiments, we demonstrate that PRDP can match the reward maximization ability of well-established RL-based methods in small-scale training. Furthermore, through large-scale training on text prompts from the Human Preference Dataset v2 and the Pick-a-Pic v1 dataset, PRDP achieves superior generation quality on a diverse set of complex, unseen prompts whereas RL-based methods completely fail.

PRDP: 확산 모델의 대규모 보정 미세 조정을 위한 근접 보상 차이 예측

PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models

초록

Support