人間のフィードバックを用いて報酬モデルなしで拡散モデルをファインチューニングする

要旨

人間のフィードバックを用いた強化学習（RLHF）は、拡散モデルのファインチューニングにおいて大きな可能性を示しています。従来の手法では、まず人間の選好に沿った報酬モデルを訓練し、その後RL技術を活用して基盤となるモデルを微調整します。しかし、効率的な報酬モデルの構築には大規模なデータセット、最適なアーキテクチャ、手動のハイパーパラメータ調整が必要であり、このプロセスは時間とコストの両面で負担が大きいという課題があります。大規模言語モデルのファインチューニングに有効な直接選好最適化（DPO）手法は、報酬モデルの必要性を排除しますが、拡散モデルのノイズ除去プロセスに伴う膨大なGPUメモリ要件がDPO手法の直接的な適用を妨げています。この問題を解決するため、我々は拡散モデルを直接ファインチューニングするための直接選好型ノイズ除去拡散ポリシー最適化（D3PO）手法を提案します。理論的な分析により、D3POは報酬モデルの訓練を省略するものの、人間のフィードバックデータを用いて訓練された最適な報酬モデルとして機能し、学習プロセスを効果的に導くことが示されています。このアプローチは報酬モデルの訓練を必要とせず、より直接的でコスト効率が高く、計算オーバーヘッドを最小化します。実験では、我々の手法は人間の選好の代理として目的関数の相対的な尺度を使用し、真の報酬を用いる手法と同等の結果を達成しました。さらに、D3POは画像の歪み率を低減し、より安全な画像を生成する能力を示し、堅牢な報酬モデルが不足している課題を克服しています。

English

Using reinforcement learning with human feedback (RLHF) has shown significant promise in fine-tuning diffusion models. Previous methods start by training a reward model that aligns with human preferences, then leverage RL techniques to fine-tune the underlying models. However, crafting an efficient reward model demands extensive datasets, optimal architecture, and manual hyperparameter tuning, making the process both time and cost-intensive. The direct preference optimization (DPO) method, effective in fine-tuning large language models, eliminates the necessity for a reward model. However, the extensive GPU memory requirement of the diffusion model's denoising process hinders the direct application of the DPO method. To address this issue, we introduce the Direct Preference for Denoising Diffusion Policy Optimization (D3PO) method to directly fine-tune diffusion models. The theoretical analysis demonstrates that although D3PO omits training a reward model, it effectively functions as the optimal reward model trained using human feedback data to guide the learning process. This approach requires no training of a reward model, proving to be more direct, cost-effective, and minimizing computational overhead. In experiments, our method uses the relative scale of objectives as a proxy for human preference, delivering comparable results to methods using ground-truth rewards. Moreover, D3PO demonstrates the ability to reduce image distortion rates and generate safer images, overcoming challenges lacking robust reward models.

人間のフィードバックを用いて報酬モデルなしで拡散モデルをファインチューニングする

Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model

要旨

Support