MusicLDM: 비트 동기화 믹스업 전략을 활용한 텍스트-음악 생성의 신선함 강화

초록

디퓨전 모델은 텍스트-이미지 및 텍스트-오디오 생성과 같은 크로스 모달 생성 작업에서 유망한 결과를 보여왔습니다. 그러나 음악이라는 특수한 형태의 오디오를 생성하는 것은 음악 데이터의 제한된 가용성과 저작권 및 표절과 관련된 민감한 문제로 인해 독특한 도전 과제를 제시합니다. 본 논문에서는 이러한 도전 과제를 해결하기 위해, 먼저 Stable Diffusion과 AudioLDM 아키텍처를 음악 도메인에 적용한 최첨단 텍스트-음악 모델인 MusicLDM을 구축합니다. 이를 위해 MusicLDM의 구성 요소인 대조적 언어-오디오 사전 학습 모델(CLAP)과 Hifi-GAN 보코더를 음악 데이터 샘플 컬렉션에서 재학습시킵니다. 그런 다음, 학습 데이터의 한계를 해결하고 표절을 방지하기 위해 비트 추적 모델을 활용하고 두 가지 다른 데이터 증강을 위한 믹스업 전략을 제안합니다: 비트 동기 오디오 믹스업과 비트 동기 잠재 공간 믹스업으로, 각각 학습 오디오를 직접 재조합하거나 잠재 임베딩 공간을 통해 재조합합니다. 이러한 믹스업 전략은 모델이 음악 학습 샘플 간을 보간하고 학습 데이터의 볼록 껍질 내에서 새로운 음악을 생성하도록 장려하여, 생성된 음악이 더 다양하면서도 해당 스타일에 충실하도록 만듭니다. 또한 널리 사용되는 평가 지표 외에도, CLAP 점수를 기반으로 한 여러 새로운 평가 지표를 설계하여, 제안된 MusicLDM과 비트 동기 믹스업 전략이 생성된 음악의 품질과 독창성, 그리고 입력 텍스트와 생성된 음악 간의 일관성을 모두 개선함을 입증합니다.

English

Diffusion models have shown promising results in cross-modal generation tasks, including text-to-image and text-to-audio generation. However, generating music, as a special type of audio, presents unique challenges due to limited availability of music data and sensitive issues related to copyright and plagiarism. In this paper, to tackle these challenges, we first construct a state-of-the-art text-to-music model, MusicLDM, that adapts Stable Diffusion and AudioLDM architectures to the music domain. We achieve this by retraining the contrastive language-audio pretraining model (CLAP) and the Hifi-GAN vocoder, as components of MusicLDM, on a collection of music data samples. Then, to address the limitations of training data and to avoid plagiarism, we leverage a beat tracking model and propose two different mixup strategies for data augmentation: beat-synchronous audio mixup and beat-synchronous latent mixup, which recombine training audio directly or via a latent embeddings space, respectively. Such mixup strategies encourage the model to interpolate between musical training samples and generate new music within the convex hull of the training data, making the generated music more diverse while still staying faithful to the corresponding style. In addition to popular evaluation metrics, we design several new evaluation metrics based on CLAP score to demonstrate that our proposed MusicLDM and beat-synchronous mixup strategies improve both the quality and novelty of generated music, as well as the correspondence between input text and generated music.

MusicLDM: 비트 동기화 믹스업 전략을 활용한 텍스트-음악 생성의 신선함 강화

MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies

초록

Support