音乐强化学习:将音乐生成与人类偏好对齐
MusicRL: Aligning Music Generation to Human Preferences
February 6, 2024
作者: Geoffrey Cideron, Sertan Girgin, Mauro Verzetti, Damien Vincent, Matej Kastelic, Zalán Borsos, Brian McWilliams, Victor Ungureanu, Olivier Bachem, Olivier Pietquin, Matthieu Geist, Léonard Hussenot, Neil Zeghidour, Andrea Agostinelli
cs.AI
摘要
我们提出了MusicRL,这是第一个通过人类反馈微调的音乐生成系统。由于音乐性的概念以及标题背后的具体意图是依赖于用户的主观判断的(例如,“欢快的健身音乐”这样的标题可以映射到复古吉他独奏或Techno流行节拍),因此对文本到音乐模型的评价尤其主观。这不仅使得这类模型的监督训练具有挑战性,还需要在部署后微调中整合持续的人类反馈。MusicRL是一个经过预训练的自回归MusicLM(Agostinelli等,2023)模型,通过强化学习微调离散音频标记以最大化序列级奖励。我们设计了与文本一致性和音频质量相关的奖励函数,并在选定的评估者的帮助下使用这些函数将MusicLM微调为MusicRL-R。我们将MusicLM部署给用户,并收集了一个包含30万个成对偏好的大量数据集。利用人类反馈的强化学习(RLHF),我们训练了MusicRL-U,这是第一个在规模上整合人类反馈的文本到音乐模型。人类评估显示,MusicRL-R和MusicRL-U都优于基准模型。最终,MusicRL-RU结合了这两种方法,并根据人类评估者的意见得出最佳模型。消融研究揭示了影响人类偏好的音乐属性,表明文本一致性和质量只占其中的一部分。这突显了音乐欣赏中主观性的普遍存在,并呼吁进一步让人类听众参与音乐生成模型的微调。
English
We propose MusicRL, the first music generation system finetuned from human
feedback. Appreciation of text-to-music models is particularly subjective since
the concept of musicality as well as the specific intention behind a caption
are user-dependent (e.g. a caption such as "upbeat work-out music" can map to a
retro guitar solo or a techno pop beat). Not only this makes supervised
training of such models challenging, but it also calls for integrating
continuous human feedback in their post-deployment finetuning. MusicRL is a
pretrained autoregressive MusicLM (Agostinelli et al., 2023) model of discrete
audio tokens finetuned with reinforcement learning to maximise sequence-level
rewards. We design reward functions related specifically to text-adherence and
audio quality with the help from selected raters, and use those to finetune
MusicLM into MusicRL-R. We deploy MusicLM to users and collect a substantial
dataset comprising 300,000 pairwise preferences. Using Reinforcement Learning
from Human Feedback (RLHF), we train MusicRL-U, the first text-to-music model
that incorporates human feedback at scale. Human evaluations show that both
MusicRL-R and MusicRL-U are preferred to the baseline. Ultimately, MusicRL-RU
combines the two approaches and results in the best model according to human
raters. Ablation studies shed light on the musical attributes influencing human
preferences, indicating that text adherence and quality only account for a part
of it. This underscores the prevalence of subjectivity in musical appreciation
and calls for further involvement of human listeners in the finetuning of music
generation models.