拡散モデルの時間変化反転を用いた音楽スタイル転移

要旨

拡散モデルの発展に伴い、テキストガイドによる画像スタイル転送は高品質で制御可能な合成結果を示してきました。しかし、多様な音楽スタイル転送におけるテキストの利用は、主にマッチしたオーディオ-テキストデータセットの限られた可用性により、大きな課題を抱えています。音楽は抽象的な複雑な芸術形式であり、同じジャンル内でも変動や細部の違いが存在するため、正確なテキスト記述が困難です。本論文では、最小限のデータを用いて音楽の属性を効果的に捉える音楽スタイル転送手法を提案します。異なるレベルでメルスペクトログラムの特徴を正確に捉えるための新しい時変テキスト反転モジュールを導入します。推論時には、安定した結果を得るためのバイアス低減スタイライゼーション技術を提案します。実験結果から、本手法が特定の楽器のスタイルを転送できること、また自然音を取り入れてメロディを構成できることが示されています。サンプルとソースコードはhttps://lsfhuihuiff.github.io/MusicTI/で公開されています。

English

With the development of diffusion models, text-guided image style transfer has demonstrated high-quality controllable synthesis results. However, the utilization of text for diverse music style transfer poses significant challenges, primarily due to the limited availability of matched audio-text datasets. Music, being an abstract and complex art form, exhibits variations and intricacies even within the same genre, thereby making accurate textual descriptions challenging. This paper presents a music style transfer approach that effectively captures musical attributes using minimal data. We introduce a novel time-varying textual inversion module to precisely capture mel-spectrogram features at different levels. During inference, we propose a bias-reduced stylization technique to obtain stable results. Experimental results demonstrate that our method can transfer the style of specific instruments, as well as incorporate natural sounds to compose melodies. Samples and source code are available at https://lsfhuihuiff.github.io/MusicTI/.

拡散モデルの時間変化反転を用いた音楽スタイル転移

Music Style Transfer with Time-Varying Inversion of Diffusion Models

要旨

Support