MusicLDM：使用節拍同步混合策略增強文本轉音樂生成中的新穎性

摘要

擴散模型在跨模態生成任務中展現了令人期待的成果，包括文本到圖像和文本到音訊的生成。然而，生成音樂作為一種特殊類型的音訊，由於音樂數據的有限可用性以及與版權和抄襲相關的敏感問題，面臨著獨特的挑戰。在本文中，為應對這些挑戰，我們首先構建了一個最先進的文本到音樂模型MusicLDM，該模型將穩定擴散和AudioLDM架構應用於音樂領域。我們通過對一組音樂數據樣本重新訓練對比語言-音頻預訓練模型（CLAP）和Hifi-GAN聲碼器，作為MusicLDM的組件來實現這一點。然後，為了應對訓練數據的限制並避免抄襲，我們利用節拍跟踪模型，並提出了兩種不同的混合策略進行數據擴增：節拍同步音頻混合和節拍同步潛在混合，分別直接或通過潛在嵌入空間重新組合訓練音頻。這些混合策略鼓勵模型在音樂訓練樣本之間進行插值，生成新的音樂，使生成的音樂更加多樣化，同時仍然忠於相應的風格。除了常見的評估指標外，我們設計了幾個基於CLAP分數的新評估指標，以證明我們提出的MusicLDM和節拍同步混合策略提高了生成音樂的質量和新穎性，以及輸入文本與生成音樂之間的對應關係。

English

Diffusion models have shown promising results in cross-modal generation tasks, including text-to-image and text-to-audio generation. However, generating music, as a special type of audio, presents unique challenges due to limited availability of music data and sensitive issues related to copyright and plagiarism. In this paper, to tackle these challenges, we first construct a state-of-the-art text-to-music model, MusicLDM, that adapts Stable Diffusion and AudioLDM architectures to the music domain. We achieve this by retraining the contrastive language-audio pretraining model (CLAP) and the Hifi-GAN vocoder, as components of MusicLDM, on a collection of music data samples. Then, to address the limitations of training data and to avoid plagiarism, we leverage a beat tracking model and propose two different mixup strategies for data augmentation: beat-synchronous audio mixup and beat-synchronous latent mixup, which recombine training audio directly or via a latent embeddings space, respectively. Such mixup strategies encourage the model to interpolate between musical training samples and generate new music within the convex hull of the training data, making the generated music more diverse while still staying faithful to the corresponding style. In addition to popular evaluation metrics, we design several new evaluation metrics based on CLAP score to demonstrate that our proposed MusicLDM and beat-synchronous mixup strategies improve both the quality and novelty of generated music, as well as the correspondence between input text and generated music.

MusicLDM：使用節拍同步混合策略增強文本轉音樂生成中的新穎性

MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies

摘要

Support