MusicRL: 음악 생성과 인간의 선호도 정렬

초록

우리는 인간 피드백을 기반으로 미세 조정된 최초의 음악 생성 시스템인 MusicRL을 제안한다. 텍스트-음악 모델에 대한 평가는 특히 주관적일 수밖에 없는 것이, 음악성이라는 개념뿐만 아니라 캡션에 담긴 특정 의도도 사용자에 따라 달라지기 때문이다(예: "활기찬 운동 음악"이라는 캡션은 레트로 기타 솔로나 테크노 팝 비트로 해석될 수 있음). 이는 이러한 모델의 지도 학습을 어렵게 할 뿐만 아니라, 배포 후 미세 조정 과정에서도 지속적인 인간 피드백의 통합을 요구한다. MusicRL은 이산 오디오 토큰으로 사전 학습된 자회귀적 MusicLM(Agostinelli et al., 2023) 모델을 강화 학습을 통해 시퀀스 수준의 보상을 극대화하도록 미세 조정한 것이다. 우리는 선별된 평가자들의 도움을 받아 텍스트 준수도와 오디오 품질과 관련된 보상 함수를 설계하고, 이를 사용해 MusicLM을 MusicRL-R로 미세 조정한다. MusicLM을 사용자에게 배포하여 300,000개의 쌍별 선호도 데이터셋을 수집하고, 인간 피드백을 통한 강화 학습(RLHF)을 적용해 대규모로 인간 피드백을 통합한 최초의 텍스트-음악 모델인 MusicRL-U를 학습시킨다. 인간 평가 결과, MusicRL-R과 MusicRL-U 모두 기준 모델보다 선호되는 것으로 나타났다. 궁극적으로, MusicRL-RU는 두 접근법을 결합하여 평가자들에게 최고의 모델로 평가받았다. 제거 실험은 인간 선호도에 영향을 미치는 음악적 속성을 밝혀냈으며, 텍스트 준수도와 품질이 그 일부만을 설명한다는 것을 보여준다. 이는 음악 평가에서 주관성이 지배적임을 강조하며, 음악 생성 모델의 미세 조정 과정에서 인간 청취자의 더 깊은 관여가 필요함을 시사한다.

English

We propose MusicRL, the first music generation system finetuned from human feedback. Appreciation of text-to-music models is particularly subjective since the concept of musicality as well as the specific intention behind a caption are user-dependent (e.g. a caption such as "upbeat work-out music" can map to a retro guitar solo or a techno pop beat). Not only this makes supervised training of such models challenging, but it also calls for integrating continuous human feedback in their post-deployment finetuning. MusicRL is a pretrained autoregressive MusicLM (Agostinelli et al., 2023) model of discrete audio tokens finetuned with reinforcement learning to maximise sequence-level rewards. We design reward functions related specifically to text-adherence and audio quality with the help from selected raters, and use those to finetune MusicLM into MusicRL-R. We deploy MusicLM to users and collect a substantial dataset comprising 300,000 pairwise preferences. Using Reinforcement Learning from Human Feedback (RLHF), we train MusicRL-U, the first text-to-music model that incorporates human feedback at scale. Human evaluations show that both MusicRL-R and MusicRL-U are preferred to the baseline. Ultimately, MusicRL-RU combines the two approaches and results in the best model according to human raters. Ablation studies shed light on the musical attributes influencing human preferences, indicating that text adherence and quality only account for a part of it. This underscores the prevalence of subjectivity in musical appreciation and calls for further involvement of human listeners in the finetuning of music generation models.

MusicRL: 음악 생성과 인간의 선호도 정렬

MusicRL: Aligning Music Generation to Human Preferences

초록

Support