Instruct-MusicGen:通过指导调整解锁音乐语言模型的文本到音乐编辑
Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning
May 28, 2024
作者: Yixiao Zhang, Yukara Ikemiya, Woosung Choi, Naoki Murata, Marco A. Martínez-Ramírez, Liwei Lin, Gus Xia, Wei-Hsiang Liao, Yuki Mitsufuji, Simon Dixon
cs.AI
摘要
最近在文本转音乐编辑方面取得的进展,利用文本查询来修改音乐(例如通过改变风格或调整乐器组件),为AI辅助音乐创作带来了独特的挑战和机遇。在这一领域先前的方法受到了训练特定编辑模型的限制,这既耗费资源又低效;其他研究使用大型语言模型来预测编辑后的音乐,导致音频重建不精确。为了结合优势并解决这些限制,我们引入了Instruct-MusicGen,这是一种新颖的方法,通过微调预训练的MusicGen模型来有效地遵循编辑指令,如添加、删除或分离音轨。我们的方法修改了原始MusicGen架构,加入了文本融合模块和音频融合模块,使模型能够同时处理指令文本和音频输入,并产生所需的编辑后音乐。值得注意的是,Instruct-MusicGen仅向原始MusicGen模型引入了8%的新参数,并仅训练了5K步,但在所有任务中表现优越于现有基准,并展示了与专门任务训练的模型相媲美的性能。这一进展不仅提升了文本转音乐编辑的效率,还拓宽了音乐语言模型在动态音乐制作环境中的适用性。
English
Recent advances in text-to-music editing, which employ text queries to modify
music (e.g.\ by changing its style or adjusting instrumental components),
present unique challenges and opportunities for AI-assisted music creation.
Previous approaches in this domain have been constrained by the necessity to
train specific editing models from scratch, which is both resource-intensive
and inefficient; other research uses large language models to predict edited
music, resulting in imprecise audio reconstruction. To Combine the strengths
and address these limitations, we introduce Instruct-MusicGen, a novel approach
that finetunes a pretrained MusicGen model to efficiently follow editing
instructions such as adding, removing, or separating stems. Our approach
involves a modification of the original MusicGen architecture by incorporating
a text fusion module and an audio fusion module, which allow the model to
process instruction texts and audio inputs concurrently and yield the desired
edited music. Remarkably, Instruct-MusicGen only introduces 8% new parameters
to the original MusicGen model and only trains for 5K steps, yet it achieves
superior performance across all tasks compared to existing baselines, and
demonstrates performance comparable to the models trained for specific tasks.
This advancement not only enhances the efficiency of text-to-music editing but
also broadens the applicability of music language models in dynamic music
production environments.