高效神经音乐生成

摘要

最近音乐生成领域取得了显著进展，这主要得益于最先进的MusicLM。该模型包括三个层次的LM，分别用于语义建模、粗粒度声学建模和细粒度声学建模。然而，使用MusicLM 进行采样需要逐个通过这些LM 进行处理，以获得细粒度的声学标记，这使得计算成本高昂，难以实现实时生成。以与MusicLM 相媲美的质量进行高效音乐生成仍然是一个重大挑战。在本文中，我们提出了MeLoDy（M代表音乐；L代表LM；D代表扩散），这是一种LM引导的扩散模型，可以生成具有最先进质量的音乐音频，同时在采样10秒或30秒音乐时分别减少了MusicLM 中95.7%或99.6%的前向传递。MeLoDy继承了MusicLM 中的最高级LM 用于语义建模，并应用了一种新颖的双路径扩散（DPD）模型和音频VAE-GAN，以高效地将条件语义标记解码为波形。DPD 被提出以通过在每个去噪步骤中的交叉注意力有效地将语义信息整合到潜在段中，从而同时建模粗粒度和细粒度声学。我们的实验结果表明MeLoDy 的优越性，不仅在采样速度和无限延续生成方面具有实际优势，而且在音乐性、音频质量和文本相关性方面也达到了最先进水平。我们的样本可在https://Efficient-MeLoDy.github.io/ 上获取。

English

Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real-time generation. Efficient music generation with a quality on par with MusicLM remains a significant challenge. In this paper, we present MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality meanwhile reducing 95.7% or 99.6% forward passes in MusicLM, respectively, for sampling 10s or 30s music. MeLoDy inherits the highest-level LM from MusicLM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efficiently decode the conditioning semantic tokens into waveform. DPD is proposed to simultaneously model the coarse and fine acoustics by incorporating the semantic information into segments of latents effectively via cross-attention at each denoising step. Our experimental results suggest the superiority of MeLoDy, not only in its practical advantages on sampling speed and infinitely continuable generation, but also in its state-of-the-art musicality, audio quality, and text correlation. Our samples are available at https://Efficient-MeLoDy.github.io/.

高效神经音乐生成

Efficient Neural Music Generation

摘要

Support