MuseCoco: テキストからのシンボリック音楽生成

要旨

テキスト記述から音楽を生成することは、テキストがユーザーにとって比較的扱いやすいインターフェースであるため、ユーザーフレンドリーなモードです。一部のアプローチでは、テキストを用いて音楽オーディオの生成を制御しますが、生成されたオーディオの音楽要素を編集することはユーザーにとって困難です。一方、シンボリック音楽は編集が容易であり、ユーザーが特定の音楽要素を操作するのに適しています。本論文では、MuseCocoを提案します。これは、テキスト記述からシンボリック音楽を生成し、音楽属性を橋渡しとして、タスクをテキストから属性への理解と属性から音楽への生成の2段階に分解します。MuseCoco（Music Composition Copilot）は、音楽家が与えられたテキスト記述から直接音楽を生成することを可能にし、ゼロから音楽を作成するよりも効率が大幅に向上します。このシステムには2つの主な利点があります。第一に、データ効率が高いことです。属性から音楽を生成する段階では、属性を音楽シーケンスから直接抽出できるため、モデルのトレーニングが自己教師あり学習となります。テキストから属性を理解する段階では、定義された属性テンプレートに基づいてChatGPTがテキストを合成・洗練します。第二に、システムはテキスト記述内の特定の属性を用いて正確な制御を実現し、属性条件付きまたはテキスト条件付きのアプローチを通じて複数の制御オプションを提供します。MuseCocoは、音楽性、制御性、総合スコアにおいて、ベースラインシステムをそれぞれ少なくとも1.27、1.08、1.32ポイント上回ります。さらに、客観的な制御精度が約20％向上しています。加えて、12億パラメータを持つ堅牢な大規模モデルを開発し、卓越した制御性と音楽性を示しています。

English

Generating music from text descriptions is a user-friendly mode since the text is a relatively easy interface for user engagement. While some approaches utilize texts to control music audio generation, editing musical elements in generated audio is challenging for users. In contrast, symbolic music offers ease of editing, making it more accessible for users to manipulate specific musical elements. In this paper, we propose MuseCoco, which generates symbolic music from text descriptions with musical attributes as the bridge to break down the task into text-to-attribute understanding and attribute-to-music generation stages. MuseCoCo stands for Music Composition Copilot that empowers musicians to generate music directly from given text descriptions, offering a significant improvement in efficiency compared to creating music entirely from scratch. The system has two main advantages: Firstly, it is data efficient. In the attribute-to-music generation stage, the attributes can be directly extracted from music sequences, making the model training self-supervised. In the text-to-attribute understanding stage, the text is synthesized and refined by ChatGPT based on the defined attribute templates. Secondly, the system can achieve precise control with specific attributes in text descriptions and offers multiple control options through attribute-conditioned or text-conditioned approaches. MuseCoco outperforms baseline systems in terms of musicality, controllability, and overall score by at least 1.27, 1.08, and 1.32 respectively. Besides, there is a notable enhancement of about 20% in objective control accuracy. In addition, we have developed a robust large-scale model with 1.2 billion parameters, showcasing exceptional controllability and musicality.

MuseCoco: テキストからのシンボリック音楽生成

MuseCoco: Generating Symbolic Music from Text

要旨

Support