潜在扩散下的长格式音乐生成
Long-form music generation with latent diffusion
April 16, 2024
作者: Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons
cs.AI
摘要
最近,基于音频的音乐生成模型取得了巨大进展,但迄今为止尚未成功生成具有连贯音乐结构的完整音乐曲目。我们展示通过在长时间上下文中训练生成模型,可以生成长达4分45秒的音乐作品。我们的模型由在高度下采样的连续潜在表示(潜在速率为21.5赫兹)上运行的扩散-变压器组成。根据音频质量和提示对齐度等指标,它获得了最先进的生成结果,并主观测试显示,它生成具有连贯结构的完整音乐作品。
English
Audio-based generative models for music have seen great strides recently, but
so far have not managed to produce full-length music tracks with coherent
musical structure. We show that by training a generative model on long temporal
contexts it is possible to produce long-form music of up to 4m45s. Our model
consists of a diffusion-transformer operating on a highly downsampled
continuous latent representation (latent rate of 21.5Hz). It obtains
state-of-the-art generations according to metrics on audio quality and prompt
alignment, and subjective tests reveal that it produces full-length music with
coherent structure.Summary
AI-Generated Summary