MusicRL: 音楽生成を人間の好みに合わせる

要旨

我々は、人間のフィードバックからファインチューニングされた初の音楽生成システムであるMusicRLを提案する。テキストから音楽を生成するモデルの評価は特に主観的であり、音楽性の概念やキャプションに込められた特定の意図はユーザー依存である（例えば、「アップビートなワークアウト音楽」というキャプションは、レトロなギターソロやテクノポップのビートにマッピングされる可能性がある）。これにより、このようなモデルの教師あり学習が困難になるだけでなく、デプロイ後のファインチューニングにおいて継続的な人間のフィードバックを統合する必要性が高まる。MusicRLは、離散オーディオトークンの事前学習済み自己回帰型MusicLM（Agostinelli et al., 2023）モデルを、シーケンスレベルの報酬を最大化するために強化学習でファインチューニングしたものである。我々は、選ばれた評価者の助けを借りて、テキストの忠実度とオーディオ品質に関連する報酬関数を設計し、それらを使用してMusicLMをMusicRL-Rにファインチューニングする。MusicLMをユーザーにデプロイし、30万件のペアワイズ選好を含む大規模なデータセットを収集する。人間のフィードバックからの強化学習（RLHF）を使用して、大規模な人間のフィードバックを組み込んだ初のテキストから音楽を生成するモデルであるMusicRL-Uを訓練する。人間による評価では、MusicRL-RとMusicRL-Uの両方がベースラインよりも好まれることが示されている。最終的に、MusicRL-RUは両方のアプローチを組み合わせ、人間の評価者によると最良のモデルとなる。アブレーション研究は、人間の選好に影響を与える音楽的属性に光を当て、テキストの忠実度と品質がその一部しか占めていないことを示している。これは、音楽鑑賞における主観性の普遍性を強調し、音楽生成モデルのファインチューニングにおける人間のリスナーのさらなる関与を求めるものである。

English

We propose MusicRL, the first music generation system finetuned from human feedback. Appreciation of text-to-music models is particularly subjective since the concept of musicality as well as the specific intention behind a caption are user-dependent (e.g. a caption such as "upbeat work-out music" can map to a retro guitar solo or a techno pop beat). Not only this makes supervised training of such models challenging, but it also calls for integrating continuous human feedback in their post-deployment finetuning. MusicRL is a pretrained autoregressive MusicLM (Agostinelli et al., 2023) model of discrete audio tokens finetuned with reinforcement learning to maximise sequence-level rewards. We design reward functions related specifically to text-adherence and audio quality with the help from selected raters, and use those to finetune MusicLM into MusicRL-R. We deploy MusicLM to users and collect a substantial dataset comprising 300,000 pairwise preferences. Using Reinforcement Learning from Human Feedback (RLHF), we train MusicRL-U, the first text-to-music model that incorporates human feedback at scale. Human evaluations show that both MusicRL-R and MusicRL-U are preferred to the baseline. Ultimately, MusicRL-RU combines the two approaches and results in the best model according to human raters. Ablation studies shed light on the musical attributes influencing human preferences, indicating that text adherence and quality only account for a part of it. This underscores the prevalence of subjectivity in musical appreciation and calls for further involvement of human listeners in the finetuning of music generation models.

MusicRL: 音楽生成を人間の好みに合わせる

MusicRL: Aligning Music Generation to Human Preferences

要旨

Support