多様性に報いられたCFG蒸留

要旨

生成モデルは、音楽生成などの創造的な領域を変革しており、分類器フリーガイダンス（CFG）などの推論時戦略が重要な役割を果たしています。ただし、CFGは生成されたコンテンツ全体の独自性と多様性を制限しながら、推論コストを倍増させます。本論文では、多様性報酬CFG蒸留という、CFGの強みを蒸留しつつその制限に対処する新しいファインチューニング手法を紹介します。当アプローチは、2つのトレーニング目的を最適化します：（1）蒸留目的は、モデル自体（CFGなしで）にCFG拡張予測を模倣するよう促し、（2）多様性報酬を持つRL目的は、特定のプロンプトに対して多様な出力の生成を促進します。ファインチューニングにより、推論コストをかけることなく、高品質で多様な出力を生成する能力を持つモデルの重みを学習します。これにより、重みベースのモデル統合戦略の可能性も開かれます：2つのモデルの重み（1つは品質に焦点を当て、もう1つは多様性に焦点を当てたもの）の間を補間することで、展開時に品質と多様性のトレードオフを制御し、さらなるパフォーマンス向上も可能です。我々は、MusicLM（Agostinelliら、2023年）テキストから音楽を生成するモデルで広範な実験を行い、我々のアプローチが品質と多様性のパレート最適性においてCFGを上回ることを示しました。人間の評価者によると、ファインチューニングしてから統合されたモデルは、CFGで拡張されたベースモデルよりも高品質かつ多様性のあるサンプルを生成します。生成物を以下でご覧いただけます：https://google-research.github.io/seanet/musiclm/diverse_music/。

English

Generative models are transforming creative domains such as music generation, with inference-time strategies like Classifier-Free Guidance (CFG) playing a crucial role. However, CFG doubles inference cost while limiting originality and diversity across generated contents. In this paper, we introduce diversity-rewarded CFG distillation, a novel finetuning procedure that distills the strengths of CFG while addressing its limitations. Our approach optimises two training objectives: (1) a distillation objective, encouraging the model alone (without CFG) to imitate the CFG-augmented predictions, and (2) an RL objective with a diversity reward, promoting the generation of diverse outputs for a given prompt. By finetuning, we learn model weights with the ability to generate high-quality and diverse outputs, without any inference overhead. This also unlocks the potential of weight-based model merging strategies: by interpolating between the weights of two models (the first focusing on quality, the second on diversity), we can control the quality-diversity trade-off at deployment time, and even further boost performance. We conduct extensive experiments on the MusicLM (Agostinelli et al., 2023) text-to-music generative model, where our approach surpasses CFG in terms of quality-diversity Pareto optimality. According to human evaluators, our finetuned-then-merged model generates samples with higher quality-diversity than the base model augmented with CFG. Explore our generations at https://google-research.github.io/seanet/musiclm/diverse_music/.

多様性に報いられたCFG蒸留

Diversity-Rewarded CFG Distillation

要旨

Summary

Support