通過時間變化的擴散模型反演進行音樂風格轉移。

摘要

隨著擴散模型的發展，文本引導的圖像風格轉移展示了高質量的可控合成結果。然而，利用文本進行多樣音樂風格轉移在實踐中面臨著重大挑戰，主要是由於匹配音頻-文本數據集的有限可用性。音樂作為一種抽象且複雜的藝術形式，即使在同一流派內也呈現出變化和細微之處，這使得準確的文本描述具有挑戰性。本文提出了一種音樂風格轉移方法，能夠使用最少的數據有效地捕捉音樂特徵。我們引入了一個新穎的時間變化的文本反演模塊，以精確捕捉不同級別的mel-spectrogram特徵。在推斷過程中，我們提出了一種減少偏差的風格化技術，以獲得穩定的結果。實驗結果表明，我們的方法可以轉移特定樂器的風格，並將自然聲音融入到旋律中。樣本和源代碼可在https://lsfhuihuiff.github.io/MusicTI/找到。

English

With the development of diffusion models, text-guided image style transfer has demonstrated high-quality controllable synthesis results. However, the utilization of text for diverse music style transfer poses significant challenges, primarily due to the limited availability of matched audio-text datasets. Music, being an abstract and complex art form, exhibits variations and intricacies even within the same genre, thereby making accurate textual descriptions challenging. This paper presents a music style transfer approach that effectively captures musical attributes using minimal data. We introduce a novel time-varying textual inversion module to precisely capture mel-spectrogram features at different levels. During inference, we propose a bias-reduced stylization technique to obtain stable results. Experimental results demonstrate that our method can transfer the style of specific instruments, as well as incorporate natural sounds to compose melodies. Samples and source code are available at https://lsfhuihuiff.github.io/MusicTI/.

通過時間變化的擴散模型反演進行音樂風格轉移。

Music Style Transfer with Time-Varying Inversion of Diffusion Models

摘要

Support