MuseCoco：從文本生成符號音樂

摘要

從文字描述生成音樂是一種用戶友好的模式，因為文字是相對容易讓用戶參與的界面。雖然一些方法利用文字來控制音樂音頻生成，但編輯生成音頻中的音樂元素對用戶來說是具有挑戰性的。相比之下，符號音樂提供了編輯的便利性，使用戶更容易操控特定的音樂元素。在本文中，我們提出了MuseCoco，它從文字描述生成符號音樂，以音樂屬性作為橋樑，將任務分解為文字到屬性理解和屬性到音樂生成兩個階段。MuseCoCo代表音樂作曲副駕駛，賦予音樂家直接從給定的文字描述生成音樂的能力，相較於完全從頭開始創作音樂，效率顯著提高。該系統有兩個主要優勢：首先，它具有數據效率。在屬性到音樂生成階段，屬性可以直接從音樂序列中提取，使模型訓練自監督。在文字到屬性理解階段，文字由基於定義的屬性模板的ChatGPT綜合和精煉。其次，該系統可以通過屬性條件或文字條件方法實現對特定屬性的精確控制，並提供多種控制選項。MuseCoco在音樂性、可控性和整體得分方面至少優於基準系統1.27、1.08和1.32。此外，客觀控制準確性有約20%的顯著提升。此外，我們開發了一個擁有12億參數的強大大型模型，展示了出色的可控性和音樂性。

English

Generating music from text descriptions is a user-friendly mode since the text is a relatively easy interface for user engagement. While some approaches utilize texts to control music audio generation, editing musical elements in generated audio is challenging for users. In contrast, symbolic music offers ease of editing, making it more accessible for users to manipulate specific musical elements. In this paper, we propose MuseCoco, which generates symbolic music from text descriptions with musical attributes as the bridge to break down the task into text-to-attribute understanding and attribute-to-music generation stages. MuseCoCo stands for Music Composition Copilot that empowers musicians to generate music directly from given text descriptions, offering a significant improvement in efficiency compared to creating music entirely from scratch. The system has two main advantages: Firstly, it is data efficient. In the attribute-to-music generation stage, the attributes can be directly extracted from music sequences, making the model training self-supervised. In the text-to-attribute understanding stage, the text is synthesized and refined by ChatGPT based on the defined attribute templates. Secondly, the system can achieve precise control with specific attributes in text descriptions and offers multiple control options through attribute-conditioned or text-conditioned approaches. MuseCoco outperforms baseline systems in terms of musicality, controllability, and overall score by at least 1.27, 1.08, and 1.32 respectively. Besides, there is a notable enhancement of about 20% in objective control accuracy. In addition, we have developed a robust large-scale model with 1.2 billion parameters, showcasing exceptional controllability and musicality.

MuseCoco：從文本生成符號音樂

MuseCoco: Generating Symbolic Music from Text

摘要

Support