音乐强化学习：将音乐生成与人类偏好对齐

摘要

我们提出了MusicRL，这是第一个通过人类反馈微调的音乐生成系统。由于音乐性的概念以及标题背后的具体意图是依赖于用户的主观判断的（例如，“欢快的健身音乐”这样的标题可以映射到复古吉他独奏或Techno流行节拍），因此对文本到音乐模型的评价尤其主观。这不仅使得这类模型的监督训练具有挑战性，还需要在部署后微调中整合持续的人类反馈。MusicRL是一个经过预训练的自回归MusicLM（Agostinelli等，2023）模型，通过强化学习微调离散音频标记以最大化序列级奖励。我们设计了与文本一致性和音频质量相关的奖励函数，并在选定的评估者的帮助下使用这些函数将MusicLM微调为MusicRL-R。我们将MusicLM部署给用户，并收集了一个包含30万个成对偏好的大量数据集。利用人类反馈的强化学习（RLHF），我们训练了MusicRL-U，这是第一个在规模上整合人类反馈的文本到音乐模型。人类评估显示，MusicRL-R和MusicRL-U都优于基准模型。最终，MusicRL-RU结合了这两种方法，并根据人类评估者的意见得出最佳模型。消融研究揭示了影响人类偏好的音乐属性，表明文本一致性和质量只占其中的一部分。这突显了音乐欣赏中主观性的普遍存在，并呼吁进一步让人类听众参与音乐生成模型的微调。

English

We propose MusicRL, the first music generation system finetuned from human feedback. Appreciation of text-to-music models is particularly subjective since the concept of musicality as well as the specific intention behind a caption are user-dependent (e.g. a caption such as "upbeat work-out music" can map to a retro guitar solo or a techno pop beat). Not only this makes supervised training of such models challenging, but it also calls for integrating continuous human feedback in their post-deployment finetuning. MusicRL is a pretrained autoregressive MusicLM (Agostinelli et al., 2023) model of discrete audio tokens finetuned with reinforcement learning to maximise sequence-level rewards. We design reward functions related specifically to text-adherence and audio quality with the help from selected raters, and use those to finetune MusicLM into MusicRL-R. We deploy MusicLM to users and collect a substantial dataset comprising 300,000 pairwise preferences. Using Reinforcement Learning from Human Feedback (RLHF), we train MusicRL-U, the first text-to-music model that incorporates human feedback at scale. Human evaluations show that both MusicRL-R and MusicRL-U are preferred to the baseline. Ultimately, MusicRL-RU combines the two approaches and results in the best model according to human raters. Ablation studies shed light on the musical attributes influencing human preferences, indicating that text adherence and quality only account for a part of it. This underscores the prevalence of subjectivity in musical appreciation and calls for further involvement of human listeners in the finetuning of music generation models.

音乐强化学习：将音乐生成与人类偏好对齐

MusicRL: Aligning Music Generation to Human Preferences

摘要

Support