PRDP：拡散モデルの大規模報酬ファインチューニングのための近接報酬差分予測

要旨

報酬ファインチューニングは、基盤モデルを下流タスクの目的に適合させる有望なアプローチとして注目を集めています。人間の嗜好を反映した報酬を最大化するために強化学習（RL）を用いることで、言語領域では顕著な成功を収めてきました。しかし、視覚領域では、既存のRLベースの報酬ファインチューニング手法は、大規模な訓練における不安定性に制約されており、複雑で未見のプロンプトに汎化することができません。本論文では、Proximal Reward Difference Prediction（PRDP）を提案し、10万以上のプロンプトを含む大規模なデータセットにおいて、初めて拡散モデルのブラックボックス報酬ファインチューニングを安定化させます。我々の重要な革新は、RL目的と同一の最適解を持ちながら、より良い訓練安定性を享受するReward Difference Prediction（RDP）目的関数です。具体的には、RDP目的関数は、生成された画像ペアの報酬差をそのノイズ除去軌跡から予測するように拡散モデルに課す教師あり回帰目的関数です。理論的に、完全な報酬差予測を達成する拡散モデルは、まさにRL目的関数の最大化者であることを証明します。さらに、RDP目的関数を安定して最適化するための近接更新を用いたオンラインアルゴリズムを開発します。実験では、PRDPが小規模な訓練において、確立されたRLベースの手法と同等の報酬最大化能力を発揮することを示します。さらに、Human Preference Dataset v2とPick-a-Pic v1データセットのテキストプロンプトを用いた大規模な訓練を通じて、PRDPは多様な複雑な未見のプロンプトにおいて優れた生成品質を達成する一方で、RLベースの手法は完全に失敗することを示します。

English

Reward finetuning has emerged as a promising approach to aligning foundation models with downstream objectives. Remarkable success has been achieved in the language domain by using reinforcement learning (RL) to maximize rewards that reflect human preference. However, in the vision domain, existing RL-based reward finetuning methods are limited by their instability in large-scale training, rendering them incapable of generalizing to complex, unseen prompts. In this paper, we propose Proximal Reward Difference Prediction (PRDP), enabling stable black-box reward finetuning for diffusion models for the first time on large-scale prompt datasets with over 100K prompts. Our key innovation is the Reward Difference Prediction (RDP) objective that has the same optimal solution as the RL objective while enjoying better training stability. Specifically, the RDP objective is a supervised regression objective that tasks the diffusion model with predicting the reward difference of generated image pairs from their denoising trajectories. We theoretically prove that the diffusion model that obtains perfect reward difference prediction is exactly the maximizer of the RL objective. We further develop an online algorithm with proximal updates to stably optimize the RDP objective. In experiments, we demonstrate that PRDP can match the reward maximization ability of well-established RL-based methods in small-scale training. Furthermore, through large-scale training on text prompts from the Human Preference Dataset v2 and the Pick-a-Pic v1 dataset, PRDP achieves superior generation quality on a diverse set of complex, unseen prompts whereas RL-based methods completely fail.

PRDP：拡散モデルの大規模報酬ファインチューニングのための近接報酬差分予測

PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models

要旨

Support