MusicLDM：使用节拍同步混合策略增强文本生成音乐中的新颖性

摘要

扩散模型在跨模态生成任务中表现出了良好的结果，包括文本到图像和文本到音频的生成。然而，作为一种特殊类型的音频，生成音乐面临独特挑战，因为音乐数据的可用性有限，而与版权和抄袭相关的敏感问题。本文针对这些挑战，首先构建了一种最先进的文本到音乐模型MusicLDM，该模型将稳定扩散和AudioLDM架构调整到音乐领域。我们通过在音乐数据样本集上重新训练对比语言-音频预训练模型（CLAP）和Hifi-GAN声码器作为MusicLDM的组成部分来实现这一点。然后，为了解决训练数据的限制并避免抄袭，我们利用一个节拍跟踪模型，并提出了两种不同的数据增强混合策略：节拍同步音频混合和节拍同步潜在混合，分别在训练音频直接或通过潜在嵌入空间重新组合。这些混合策略鼓励模型在训练样本之间进行插值，并在训练数据的凸包内生成新的音乐，使生成的音乐更加多样化，同时仍然忠实于相应的风格。除了常见的评估指标外，我们设计了几个基于CLAP分数的新评估指标，以证明我们提出的MusicLDM和节拍同步混合策略提高了生成音乐的质量和新颖性，以及输入文本与生成音乐之间的对应关系。

English

Diffusion models have shown promising results in cross-modal generation tasks, including text-to-image and text-to-audio generation. However, generating music, as a special type of audio, presents unique challenges due to limited availability of music data and sensitive issues related to copyright and plagiarism. In this paper, to tackle these challenges, we first construct a state-of-the-art text-to-music model, MusicLDM, that adapts Stable Diffusion and AudioLDM architectures to the music domain. We achieve this by retraining the contrastive language-audio pretraining model (CLAP) and the Hifi-GAN vocoder, as components of MusicLDM, on a collection of music data samples. Then, to address the limitations of training data and to avoid plagiarism, we leverage a beat tracking model and propose two different mixup strategies for data augmentation: beat-synchronous audio mixup and beat-synchronous latent mixup, which recombine training audio directly or via a latent embeddings space, respectively. Such mixup strategies encourage the model to interpolate between musical training samples and generate new music within the convex hull of the training data, making the generated music more diverse while still staying faithful to the corresponding style. In addition to popular evaluation metrics, we design several new evaluation metrics based on CLAP score to demonstrate that our proposed MusicLDM and beat-synchronous mixup strategies improve both the quality and novelty of generated music, as well as the correspondence between input text and generated music.

MusicLDM：使用节拍同步混合策略增强文本生成音乐中的新颖性

MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies

摘要

Support