Instruct-MusicGen：透過指導調整解鎖音樂語言模型的文本轉音樂編輯

摘要

最近在文本轉音樂編輯方面取得的進展，利用文本查詢來修改音樂（例如通過改變風格或調整樂器元件），為AI輔助音樂創作帶來獨特的挑戰和機遇。在這個領域中先前的方法受限於需要從頭開始訓練特定的編輯模型，這既耗費資源又低效；其他研究則使用大型語言模型來預測編輯後的音樂，導致音頻重建不精確。為了結合優勢並解決這些限制，我們提出了Instruct-MusicGen，一種新穎的方法，通過微調預訓練的MusicGen模型，以有效地遵循編輯指令，如添加、刪除或分離音軌。我們的方法涉及對原始MusicGen架構的修改，包括一個文本融合模組和一個音頻融合模組，這兩個模組使模型能夠同時處理指令文本和音頻輸入，並產生所需的編輯後音樂。值得注意的是，Instruct-MusicGen只向原始MusicGen模型引入了8%的新參數，並且僅訓練了5K步，但在所有任務上均表現優異，優於現有基準，並且展示出與針對特定任務訓練的模型相當的性能。這一進展不僅提高了文本轉音樂編輯的效率，還擴大了音樂語言模型在動態音樂製作環境中的應用範圍。

English

Recent advances in text-to-music editing, which employ text queries to modify music (e.g.\ by changing its style or adjusting instrumental components), present unique challenges and opportunities for AI-assisted music creation. Previous approaches in this domain have been constrained by the necessity to train specific editing models from scratch, which is both resource-intensive and inefficient; other research uses large language models to predict edited music, resulting in imprecise audio reconstruction. To Combine the strengths and address these limitations, we introduce Instruct-MusicGen, a novel approach that finetunes a pretrained MusicGen model to efficiently follow editing instructions such as adding, removing, or separating stems. Our approach involves a modification of the original MusicGen architecture by incorporating a text fusion module and an audio fusion module, which allow the model to process instruction texts and audio inputs concurrently and yield the desired edited music. Remarkably, Instruct-MusicGen only introduces 8% new parameters to the original MusicGen model and only trains for 5K steps, yet it achieves superior performance across all tasks compared to existing baselines, and demonstrates performance comparable to the models trained for specific tasks. This advancement not only enhances the efficiency of text-to-music editing but also broadens the applicability of music language models in dynamic music production environments.

Instruct-MusicGen：透過指導調整解鎖音樂語言模型的文本轉音樂編輯

Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning

摘要

Support