시간에 따라 변화하는 확산 모델의 역변환을 통한 음악 스타일 전이

초록

확산 모델의 발전과 함께 텍스트 기반 이미지 스타일 변환은 고품질의 제어 가능한 합성 결과를 보여주고 있다. 그러나 다양한 음악 스타일 변환에 텍스트를 활용하는 것은 주로 매칭된 오디오-텍스트 데이터셋의 제한된 가용성으로 인해 상당한 도전 과제로 남아 있다. 음악은 추상적이고 복잡한 예술 형태로, 동일한 장르 내에서도 변이와 복잡성을 보이기 때문에 정확한 텍스트 기술을 달성하기 어렵다. 본 논문은 최소한의 데이터를 사용하여 음악적 속성을 효과적으로 포착하는 음악 스타일 변환 접근법을 제시한다. 우리는 다양한 수준에서 멜-스펙트로그램 특징을 정확하게 포착하기 위해 새로운 시간 가변적 텍스트 역전 모듈을 소개한다. 추론 과정에서는 안정적인 결과를 얻기 위해 편향 감소 스타일화 기법을 제안한다. 실험 결과는 우리의 방법이 특정 악기의 스타일을 변환할 뿐만 아니라 자연 소리를 통합하여 멜로디를 작곡할 수 있음을 보여준다. 샘플과 소스 코드는 https://lsfhuihuiff.github.io/MusicTI/에서 확인할 수 있다.

English

With the development of diffusion models, text-guided image style transfer has demonstrated high-quality controllable synthesis results. However, the utilization of text for diverse music style transfer poses significant challenges, primarily due to the limited availability of matched audio-text datasets. Music, being an abstract and complex art form, exhibits variations and intricacies even within the same genre, thereby making accurate textual descriptions challenging. This paper presents a music style transfer approach that effectively captures musical attributes using minimal data. We introduce a novel time-varying textual inversion module to precisely capture mel-spectrogram features at different levels. During inference, we propose a bias-reduced stylization technique to obtain stable results. Experimental results demonstrate that our method can transfer the style of specific instruments, as well as incorporate natural sounds to compose melodies. Samples and source code are available at https://lsfhuihuiff.github.io/MusicTI/.

시간에 따라 변화하는 확산 모델의 역변환을 통한 음악 스타일 전이

Music Style Transfer with Time-Varying Inversion of Diffusion Models

초록

Support