音樂一致性模型
Music Consistency Models
April 20, 2024
作者: Zhengcong Fei, Mingyuan Fan, Junshi Huang
cs.AI
摘要
一致性模型在促進高效影像/影片生成方面展現出卓越能力,能夠在最少的取樣步驟下進行合成。這已被證明有助於減輕與擴散模型相關的計算負擔。然而,在音樂生成領域中,一致性模型的應用仍然未被廣泛探索。為彌補這一空白,我們提出了音樂一致性模型(MusicCM),利用一致性模型的概念,有效地合成音樂片段的mel-spectrogram,保持高質量的同時最小化取樣步驟的數量。在現有的文本到音樂擴散模型基礎上,MusicCM 模型融入了一致性蒸餾和對抗性鑑別器訓練。此外,我們發現通過結合具有共享約束的多個擴散過程來生成延續連貫的音樂是有益的。實驗結果顯示我們的模型在計算效率、保真度和自然性方面的有效性。值得注意的是,MusicCM 在僅四個取樣步驟下實現了無縫音樂合成,例如每分鐘音樂片段僅需一秒,展示了實時應用的潛力。
English
Consistency models have exhibited remarkable capabilities in facilitating
efficient image/video generation, enabling synthesis with minimal sampling
steps. It has proven to be advantageous in mitigating the computational burdens
associated with diffusion models. Nevertheless, the application of consistency
models in music generation remains largely unexplored. To address this gap, we
present Music Consistency Models (MusicCM), which leverages the
concept of consistency models to efficiently synthesize mel-spectrogram for
music clips, maintaining high quality while minimizing the number of sampling
steps. Building upon existing text-to-music diffusion models, the
MusicCM model incorporates consistency distillation and adversarial
discriminator training. Moreover, we find it beneficial to generate extended
coherent music by incorporating multiple diffusion processes with shared
constraints. Experimental results reveal the effectiveness of our model in
terms of computational efficiency, fidelity, and naturalness. Notable,
MusicCM achieves seamless music synthesis with a mere four sampling
steps, e.g., only one second per minute of the music clip, showcasing the
potential for real-time application.Summary
AI-Generated Summary