DRAGON: 分布型報酬による拡散生成モデルの最適化

要旨

我々は、所望の結果に向けてメディア生成モデルを微調整するための汎用フレームワークであるDistributional RewArds for Generative OptimizatioN（DRAGON）を提案する。従来の人間フィードバックを用いた強化学習（RLHF）や直接選好最適化（DPO）のようなペアワイズ選好アプローチと比較して、DRAGONはより柔軟性が高い。個々の事例またはその分布を評価する報酬関数を最適化できるため、インスタンス単位、インスタンス対分布、分布対分布の幅広い報酬スキームと互換性がある。この汎用性を活用し、エンコーダと参照事例のセットを選択して模範分布を作成することで、新たな報酬関数を構築する。CLAPのようなクロスモダリティエンコーダを使用する場合、参照事例は異なるモダリティ（例：テキスト対オーディオ）でも構わない。その後、DRAGONはオンラインおよびオンポリシー生成を収集し、それらをスコアリングして肯定的なデモンストレーションセットと否定的なセットを構築し、両セット間の対比を活用して報酬を最大化する。評価では、カスタム音楽美学モデル、CLAPスコア、Vendi多様性、Frechetオーディオ距離（FAD）を含む20種類の報酬関数を用いて、オーディオ領域のテキストtoミュージック拡散モデルを微調整する。さらに、インスタンス単位（曲ごと）とフルデータセットFAD設定を比較し、複数のFADエンコーダと参照セットをアブレーションする。20の目標報酬全体で、DRAGONは81.45%の平均勝率を達成する。さらに、模範セットに基づく報酬関数は実際に生成を向上させ、モデルベースの報酬と同等である。適切な模範セットを用いることで、DRAGONは人間の選好アノテーションを学習せずに60.95%の人間投票による音楽品質勝率を達成する。このように、DRAGONは人間が知覚する品質を向上させるための報酬関数の設計と最適化における新たなアプローチを示している。音声サンプルはhttps://ml-dragon.github.io/webで公開されている。

English

We present Distributional RewArds for Generative OptimizatioN (DRAGON), a versatile framework for fine-tuning media generation models towards a desired outcome. Compared with traditional reinforcement learning with human feedback (RLHF) or pairwise preference approaches such as direct preference optimization (DPO), DRAGON is more flexible. It can optimize reward functions that evaluate either individual examples or distributions of them, making it compatible with a broad spectrum of instance-wise, instance-to-distribution, and distribution-to-distribution rewards. Leveraging this versatility, we construct novel reward functions by selecting an encoder and a set of reference examples to create an exemplar distribution. When cross-modality encoders such as CLAP are used, the reference examples may be of a different modality (e.g., text versus audio). Then, DRAGON gathers online and on-policy generations, scores them to construct a positive demonstration set and a negative set, and leverages the contrast between the two sets to maximize the reward. For evaluation, we fine-tune an audio-domain text-to-music diffusion model with 20 different reward functions, including a custom music aesthetics model, CLAP score, Vendi diversity, and Frechet audio distance (FAD). We further compare instance-wise (per-song) and full-dataset FAD settings while ablating multiple FAD encoders and reference sets. Over all 20 target rewards, DRAGON achieves an 81.45% average win rate. Moreover, reward functions based on exemplar sets indeed enhance generations and are comparable to model-based rewards. With an appropriate exemplar set, DRAGON achieves a 60.95% human-voted music quality win rate without training on human preference annotations. As such, DRAGON exhibits a new approach to designing and optimizing reward functions for improving human-perceived quality. Sound examples at https://ml-dragon.github.io/web.

DRAGON: 分布型報酬による拡散生成モデルの最適化

DRAGON: Distributional Rewards Optimize Diffusion Generative Models

要旨

Support