ChatPaper.aiChatPaper

潜在扩散下的长格式音乐生成

Long-form music generation with latent diffusion

April 16, 2024
作者: Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons
cs.AI

摘要

最近,基于音频的音乐生成模型取得了巨大进展,但迄今为止尚未成功生成具有连贯音乐结构的完整音乐曲目。我们展示通过在长时间上下文中训练生成模型,可以生成长达4分45秒的音乐作品。我们的模型由在高度下采样的连续潜在表示(潜在速率为21.5赫兹)上运行的扩散-变压器组成。根据音频质量和提示对齐度等指标,它获得了最先进的生成结果,并主观测试显示,它生成具有连贯结构的完整音乐作品。
English
Audio-based generative models for music have seen great strides recently, but so far have not managed to produce full-length music tracks with coherent musical structure. We show that by training a generative model on long temporal contexts it is possible to produce long-form music of up to 4m45s. Our model consists of a diffusion-transformer operating on a highly downsampled continuous latent representation (latent rate of 21.5Hz). It obtains state-of-the-art generations according to metrics on audio quality and prompt alignment, and subjective tests reveal that it produces full-length music with coherent structure.

Summary

AI-Generated Summary

PDF281December 15, 2024