GETMusic: 통합 표현 및 확산 프레임워크를 활용한 다양한 음악 트랙 생성

초록

심볼릭 음악 생성은 사용자가 음악을 작곡하는 데 도움을 줄 수 있는 악보를 생성하는 것을 목표로 합니다. 이는 사용자가 제공한 소스 트랙을 기반으로 하거나 처음부터 타겟 악기 트랙을 생성하는 등의 작업을 포함합니다. 소스 트랙과 타겟 트랙 간의 다양하고 유연한 조합을 고려할 때, 임의의 트랙을 생성할 수 있는 통합 모델은 매우 중요합니다. 기존 연구들은 음악 표현과 모델 아키텍처의 내재적 제약으로 인해 이러한 요구를 충족하지 못했습니다. 이러한 요구를 해결하기 위해, 우리는 GETMusic(`GET'은 GEnerate music Tracks의 약자)이라는 통합 표현 및 확산 프레임워크를 제안합니다. 이 프레임워크는 GETScore라는 새로운 음악 표현과 GETDiff라는 확산 모델을 포함합니다. GETScore는 음표를 토큰으로 표현하고, 트랙을 수직으로 쌓고 시간에 따라 수평으로 진행되는 2D 구조로 조직합니다. 학습 과정에서 트랙은 무작위로 타겟 또는 소스로 선택됩니다. 순방향 과정에서 타겟 트랙의 토큰은 마스킹되어 손상되며, 소스 트랙은 그대로 유지됩니다. 노이즈 제거 과정에서 GETDiff는 소스 트랙을 조건으로 하여 마스킹된 타겟 토큰을 예측하는 방법을 학습합니다. GETScore의 분리된 트랙과 모델의 비자기회귀적 특성을 통해, GETMusic은 처음부터 또는 소스 트랙을 조건으로 하여 임의의 타겟 트랙의 생성을 명시적으로 제어할 수 있습니다. 우리는 6개의 악기 트랙을 포함한 음악 생성 실험을 수행하여 총 665가지 조합을 테스트했습니다. GETMusic은 다양한 조합에서 높은 품질의 결과를 제공하며, 특정 조합을 위해 제안된 기존 연구들을 능가합니다.

English

Symbolic music generation aims to create musical notes, which can help users compose music, such as generating target instrumental tracks from scratch, or based on user-provided source tracks. Considering the diverse and flexible combination between source and target tracks, a unified model capable of generating any arbitrary tracks is of crucial necessity. Previous works fail to address this need due to inherent constraints in music representations and model architectures. To address this need, we propose a unified representation and diffusion framework named GETMusic (`GET' stands for GEnerate music Tracks), which includes a novel music representation named GETScore, and a diffusion model named GETDiff. GETScore represents notes as tokens and organizes them in a 2D structure, with tracks stacked vertically and progressing horizontally over time. During training, tracks are randomly selected as either the target or source. In the forward process, target tracks are corrupted by masking their tokens, while source tracks remain as ground truth. In the denoising process, GETDiff learns to predict the masked target tokens, conditioning on the source tracks. With separate tracks in GETScore and the non-autoregressive behavior of the model, GETMusic can explicitly control the generation of any target tracks from scratch or conditioning on source tracks. We conduct experiments on music generation involving six instrumental tracks, resulting in a total of 665 combinations. GETMusic provides high-quality results across diverse combinations and surpasses prior works proposed for some specific combinations.

GETMusic: 통합 표현 및 확산 프레임워크를 활용한 다양한 음악 트랙 생성

GETMusic: Generating Any Music Tracks with a Unified Representation and Diffusion Framework

초록

Support