DiffRhythm:基於潛在擴散的極速簡易端到端全長歌曲生成
DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion
March 3, 2025
作者: Ziqian Ning, Huakang Chen, Yuepeng Jiang, Chunbo Hao, Guobin Ma, Shuai Wang, Jixun Yao, Lei Xie
cs.AI
摘要
近期音樂生成領域的進展引起了廣泛關注,然而現有方法仍面臨關鍵限制。一些當前的生成模型僅能合成人聲軌或伴奏軌。雖然部分模型能夠生成結合人聲與伴奏的音樂,但它們通常依賴精心設計的多階段級聯架構和複雜的數據管道,這阻礙了可擴展性。此外,大多數系統僅限於生成短音樂片段而非完整歌曲。再者,廣泛使用的基於語言模型的方法存在推理速度緩慢的問題。為應對這些挑戰,我們提出了DiffRhythm,這是首個基於潛在擴散的歌曲生成模型,能夠在僅十秒內合成長達4分45秒的完整歌曲,包含人聲和伴奏,並保持高度的音樂性和可理解性。儘管DiffRhythm具有卓越能力,但其設計簡潔優雅:它消除了複雜數據準備的需求,採用直觀的模型結構,在推理時僅需歌詞和風格提示。此外,其非自回歸結構確保了快速的推理速度。這種簡潔性保證了DiffRhythm的可擴展性。我們還發布了完整的訓練代碼及基於大規模數據的預訓練模型,以促進可重現性和進一步研究。
English
Recent advancements in music generation have garnered significant attention,
yet existing approaches face critical limitations. Some current generative
models can only synthesize either the vocal track or the accompaniment track.
While some models can generate combined vocal and accompaniment, they typically
rely on meticulously designed multi-stage cascading architectures and intricate
data pipelines, hindering scalability. Additionally, most systems are
restricted to generating short musical segments rather than full-length songs.
Furthermore, widely used language model-based methods suffer from slow
inference speeds. To address these challenges, we propose DiffRhythm, the first
latent diffusion-based song generation model capable of synthesizing complete
songs with both vocal and accompaniment for durations of up to 4m45s in only
ten seconds, maintaining high musicality and intelligibility. Despite its
remarkable capabilities, DiffRhythm is designed to be simple and elegant: it
eliminates the need for complex data preparation, employs a straightforward
model structure, and requires only lyrics and a style prompt during inference.
Additionally, its non-autoregressive structure ensures fast inference speeds.
This simplicity guarantees the scalability of DiffRhythm. Moreover, we release
the complete training code along with the pre-trained model on large-scale data
to promote reproducibility and further research.Summary
AI-Generated Summary