MuseCoco：从文本生成符号音乐

摘要

从文本描述生成音乐是一种用户友好的模式，因为文本是一个相对容易让用户参与的界面。虽然一些方法利用文本来控制音乐音频生成，但对用户来说，在生成的音频中编辑音乐元素是具有挑战性的。相比之下，符号音乐提供了易于编辑的便利性，使用户更容易操纵特定的音乐元素。在本文中，我们提出了MuseCoco，它从文本描述中生成符号音乐，以音乐属性作为桥梁，将任务分解为文本到属性理解和属性到音乐生成两个阶段。MuseCoco代表音乐创作副驾驶，赋予音乐家直接从给定文本描述中生成音乐的能力，相较于完全从头开始创作音乐，提供了显著的效率提升。该系统具有两个主要优势：首先，它具有数据效率。在属性到音乐生成阶段，属性可以直接从音乐序列中提取，使模型训练自监督。在文本到属性理解阶段，文本由基于定义的属性模板的ChatGPT合成和完善。其次，该系统可以通过基于属性或基于文本的方法，在文本描述中实现对特定属性的精确控制，并提供多种控制选项。在音乐性、可控性和整体评分方面，MuseCoco在至少1.27、1.08和1.32的指标上优于基准系统。此外，客观控制准确性也有约20%的显著提升。此外，我们开发了一个拥有12亿参数的强大大规模模型，展示了出色的可控性和音乐性。

English

Generating music from text descriptions is a user-friendly mode since the text is a relatively easy interface for user engagement. While some approaches utilize texts to control music audio generation, editing musical elements in generated audio is challenging for users. In contrast, symbolic music offers ease of editing, making it more accessible for users to manipulate specific musical elements. In this paper, we propose MuseCoco, which generates symbolic music from text descriptions with musical attributes as the bridge to break down the task into text-to-attribute understanding and attribute-to-music generation stages. MuseCoCo stands for Music Composition Copilot that empowers musicians to generate music directly from given text descriptions, offering a significant improvement in efficiency compared to creating music entirely from scratch. The system has two main advantages: Firstly, it is data efficient. In the attribute-to-music generation stage, the attributes can be directly extracted from music sequences, making the model training self-supervised. In the text-to-attribute understanding stage, the text is synthesized and refined by ChatGPT based on the defined attribute templates. Secondly, the system can achieve precise control with specific attributes in text descriptions and offers multiple control options through attribute-conditioned or text-conditioned approaches. MuseCoco outperforms baseline systems in terms of musicality, controllability, and overall score by at least 1.27, 1.08, and 1.32 respectively. Besides, there is a notable enhancement of about 20% in objective control accuracy. In addition, we have developed a robust large-scale model with 1.2 billion parameters, showcasing exceptional controllability and musicality.

MuseCoco：从文本生成符号音乐

MuseCoco: Generating Symbolic Music from Text

摘要

Support