潜在扩散下的长格式音乐生成

摘要

最近，基于音频的音乐生成模型取得了巨大进展，但迄今为止尚未成功生成具有连贯音乐结构的完整音乐曲目。我们展示通过在长时间上下文中训练生成模型，可以生成长达4分45秒的音乐作品。我们的模型由在高度下采样的连续潜在表示（潜在速率为21.5赫兹）上运行的扩散-变压器组成。根据音频质量和提示对齐度等指标，它获得了最先进的生成结果，并主观测试显示，它生成具有连贯结构的完整音乐作品。

English

Audio-based generative models for music have seen great strides recently, but so far have not managed to produce full-length music tracks with coherent musical structure. We show that by training a generative model on long temporal contexts it is possible to produce long-form music of up to 4m45s. Our model consists of a diffusion-transformer operating on a highly downsampled continuous latent representation (latent rate of 21.5Hz). It obtains state-of-the-art generations according to metrics on audio quality and prompt alignment, and subjective tests reveal that it produces full-length music with coherent structure.

潜在扩散下的长格式音乐生成

Long-form music generation with latent diffusion

摘要

Support