ChatPaper.aiChatPaper

高效的神經音樂生成

Efficient Neural Music Generation

May 25, 2023
作者: Max W. Y. Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu, Yuliang Ji, Rui Xia, Mingbo Ma, Xuchen Song, Jitong Chen, Yuping Wang, Yuxuan Wang
cs.AI

摘要

近年來,音樂生成領域取得了顯著進展,其中最先進的音樂語言模型 MusicLM 採用了一個由三個不同層次的語言模型組成的層次結構,分別用於語義建模、粗略聲學建模和細緻聲學建模。然而,使用 MusicLM 進行取樣需要逐個通過這些語言模型進行處理,以獲得細緻的聲學標記,這使得計算成本高昂,難以實時生成。以與 MusicLM 相當質量的效率進行音樂生成仍然是一個重大挑戰。在本文中,我們提出了 MeLoDy(M 代表音樂;L 代表語言模型;D 代表擴散),這是一種 LM 引導的擴散模型,可以生成具有最先進質量的音樂音頻,同時將 MusicLM 中取樣 10 秒或 30 秒音樂所需的前向傳遞次數分別減少了 95.7% 或 99.6%。MeLoDy 繼承了 MusicLM 中的最高層語言模型進行語義建模,並應用了一種新穎的雙路徑擴散(DPD)模型和音頻 VAE-GAN,以高效地將條件語義標記解碼為波形。DPD 被提出來同時建模粗略和細緻聲學,通過在每個去噪步驟中有效地將語義信息整合到潛在段落中的交叉注意力,以實現此目的。我們的實驗結果表明 MeLoDy 的優越性,不僅體現在取樣速度和無限延續生成方面的實際優勢,還體現在其最先進的音樂性、音頻質量和文本相關性上。 我們的樣本可在 https://Efficient-MeLoDy.github.io/ 上獲得。
English
Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real-time generation. Efficient music generation with a quality on par with MusicLM remains a significant challenge. In this paper, we present MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality meanwhile reducing 95.7% or 99.6% forward passes in MusicLM, respectively, for sampling 10s or 30s music. MeLoDy inherits the highest-level LM from MusicLM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efficiently decode the conditioning semantic tokens into waveform. DPD is proposed to simultaneously model the coarse and fine acoustics by incorporating the semantic information into segments of latents effectively via cross-attention at each denoising step. Our experimental results suggest the superiority of MeLoDy, not only in its practical advantages on sampling speed and infinitely continuable generation, but also in its state-of-the-art musicality, audio quality, and text correlation. Our samples are available at https://Efficient-MeLoDy.github.io/.
PDF20December 15, 2024