MuseCoco: 텍스트에서 기호 음악 생성하기

초록

텍스트 설명으로부터 음악을 생성하는 것은 텍스트가 사용자 참여를 위한 상대적으로 쉬운 인터페이스이기 때문에 사용자 친화적인 방식입니다. 일부 접근법은 텍스트를 사용하여 음악 오디오 생성을 제어하지만, 생성된 오디오에서 음악 요소를 편집하는 것은 사용자에게 어려운 작업입니다. 반면, 심볼릭 음악은 편집이 용이하여 사용자가 특정 음악 요소를 조작하기에 더 접근성이 높습니다. 본 논문에서는 텍스트 설명으로부터 심볼릭 음악을 생성하는 MuseCoco를 제안합니다. MuseCoco는 음악 속성을 매개로 하여 작업을 텍스트-속성 이해 단계와 속성-음악 생성 단계로 나눕니다. MuseCoco는 Music Composition Copilot의 약자로, 음악가들이 주어진 텍스트 설명으로부터 직접 음악을 생성할 수 있도록 하여, 처음부터 음악을 만드는 것에 비해 효율성을 크게 향상시킵니다. 이 시스템은 두 가지 주요 장점을 가지고 있습니다: 첫째, 데이터 효율적입니다. 속성-음악 생성 단계에서 속성은 음악 시퀀스에서 직접 추출될 수 있어 모델 학습이 자기 지도 학습 방식으로 이루어집니다. 텍스트-속성 이해 단계에서는 정의된 속성 템플릿을 기반으로 ChatGPT가 텍스트를 합성하고 정제합니다. 둘째, 이 시스템은 텍스트 설명의 특정 속성을 통해 정밀한 제어를 달성할 수 있으며, 속성 조건 또는 텍스트 조건 접근법을 통해 다양한 제어 옵션을 제공합니다. MuseCoco는 음악성, 제어성, 그리고 전체 점수 측면에서 기준 시스템을 각각 최소 1.27, 1.08, 1.32점 이상 능가합니다. 또한, 객관적 제어 정확도에서 약 20%의 현저한 향상을 보입니다. 추가적으로, 우리는 12억 개의 파라미터를 가진 강력한 대규모 모델을 개발하여 탁월한 제어성과 음악성을 입증했습니다.

English

Generating music from text descriptions is a user-friendly mode since the text is a relatively easy interface for user engagement. While some approaches utilize texts to control music audio generation, editing musical elements in generated audio is challenging for users. In contrast, symbolic music offers ease of editing, making it more accessible for users to manipulate specific musical elements. In this paper, we propose MuseCoco, which generates symbolic music from text descriptions with musical attributes as the bridge to break down the task into text-to-attribute understanding and attribute-to-music generation stages. MuseCoCo stands for Music Composition Copilot that empowers musicians to generate music directly from given text descriptions, offering a significant improvement in efficiency compared to creating music entirely from scratch. The system has two main advantages: Firstly, it is data efficient. In the attribute-to-music generation stage, the attributes can be directly extracted from music sequences, making the model training self-supervised. In the text-to-attribute understanding stage, the text is synthesized and refined by ChatGPT based on the defined attribute templates. Secondly, the system can achieve precise control with specific attributes in text descriptions and offers multiple control options through attribute-conditioned or text-conditioned approaches. MuseCoco outperforms baseline systems in terms of musicality, controllability, and overall score by at least 1.27, 1.08, and 1.32 respectively. Besides, there is a notable enhancement of about 20% in objective control accuracy. In addition, we have developed a robust large-scale model with 1.2 billion parameters, showcasing exceptional controllability and musicality.

MuseCoco: 텍스트에서 기호 음악 생성하기

MuseCoco: Generating Symbolic Music from Text

초록

Support