音樂強化學習:將音樂生成與人類喜好對齊
MusicRL: Aligning Music Generation to Human Preferences
February 6, 2024
作者: Geoffrey Cideron, Sertan Girgin, Mauro Verzetti, Damien Vincent, Matej Kastelic, Zalán Borsos, Brian McWilliams, Victor Ungureanu, Olivier Bachem, Olivier Pietquin, Matthieu Geist, Léonard Hussenot, Neil Zeghidour, Andrea Agostinelli
cs.AI
摘要
我們提出了MusicRL,這是第一個從人類反饋中微調的音樂生成系統。由於音樂性的概念以及標題背後的具體意圖是依賴於用戶的主觀的(例如,“快節奏運動音樂”這樣的標題可以對應到復古吉他獨奏或電子流行節拍),對於文本到音樂模型的評價尤其主觀。這不僅使得這些模型的監督式訓練具有挑戰性,還需要在部署後微調中整合持續的人類反饋。MusicRL是一個預訓練的自回歸MusicLM(Agostinelli等,2023)模型,通過強化學習微調離散音頻標記來最大化序列級別的獎勵。我們設計了與文本一致性和音頻質量相關的獎勵函數,並在選定的評分者的幫助下使用這些函數將MusicLM微調為MusicRL-R。我們將MusicLM部署給用戶,並收集了一個包含30萬對偏好的大型數據集。通過人類反饋的強化學習(RLHF),我們訓練了MusicRL-U,這是第一個在規模上整合人類反饋的文本到音樂模型。人類評估顯示,MusicRL-R和MusicRL-U都優於基準模型。最終,MusicRL-RU結合了這兩種方法,並根據人類評分者的評價結果為最佳模型。消融研究揭示了影響人類偏好的音樂特徵,表明文本一致性和質量只是其中的一部分。這凸顯了音樂欣賞中主觀性的普遍存在,並呼籲在音樂生成模型的微調中進一步加入人類聽眾的參與。
English
We propose MusicRL, the first music generation system finetuned from human
feedback. Appreciation of text-to-music models is particularly subjective since
the concept of musicality as well as the specific intention behind a caption
are user-dependent (e.g. a caption such as "upbeat work-out music" can map to a
retro guitar solo or a techno pop beat). Not only this makes supervised
training of such models challenging, but it also calls for integrating
continuous human feedback in their post-deployment finetuning. MusicRL is a
pretrained autoregressive MusicLM (Agostinelli et al., 2023) model of discrete
audio tokens finetuned with reinforcement learning to maximise sequence-level
rewards. We design reward functions related specifically to text-adherence and
audio quality with the help from selected raters, and use those to finetune
MusicLM into MusicRL-R. We deploy MusicLM to users and collect a substantial
dataset comprising 300,000 pairwise preferences. Using Reinforcement Learning
from Human Feedback (RLHF), we train MusicRL-U, the first text-to-music model
that incorporates human feedback at scale. Human evaluations show that both
MusicRL-R and MusicRL-U are preferred to the baseline. Ultimately, MusicRL-RU
combines the two approaches and results in the best model according to human
raters. Ablation studies shed light on the musical attributes influencing human
preferences, indicating that text adherence and quality only account for a part
of it. This underscores the prevalence of subjectivity in musical appreciation
and calls for further involvement of human listeners in the finetuning of music
generation models.