ChatPaper.aiChatPaper

快速定时条件潜在音频扩散

Fast Timing-Conditioned Latent Audio Diffusion

February 7, 2024
作者: Zach Evans, CJ Carr, Josiah Taylor, Scott H. Hawley, Jordi Pons
cs.AI

摘要

从文本提示生成长形式44.1kHz立体声音频可能需要大量计算。此外,大多数先前的研究并未解决音乐和音效在持续时间上自然变化的问题。我们的研究侧重于使用生成模型高效生成长形式、可变长度的44.1kHz立体音乐和声音。Stable Audio基于潜在扩散,其潜在性由全卷积变分自动编码器定义。它不仅以文本提示为条件,还以时间嵌入为条件,可以对生成的音乐和声音的内容和长度进行精细控制。Stable Audio能够在A100 GPU上以8秒的速度在44.1kHz下渲染长达95秒的立体信号。尽管其计算效率和快速推断能力,它在两个公共文本转音乐和音频基准测试中表现出色,并且与最先进的模型不同,能够生成具有结构和立体声音的音乐。
English
Generating long-form 44.1kHz stereo audio from text prompts can be computationally demanding. Further, most previous works do not tackle that music and sound effects naturally vary in their duration. Our research focuses on the efficient generation of long-form, variable-length stereo music and sounds at 44.1kHz using text prompts with a generative model. Stable Audio is based on latent diffusion, with its latent defined by a fully-convolutional variational autoencoder. It is conditioned on text prompts as well as timing embeddings, allowing for fine control over both the content and length of the generated music and sounds. Stable Audio is capable of rendering stereo signals of up to 95 sec at 44.1kHz in 8 sec on an A100 GPU. Despite its compute efficiency and fast inference, it is one of the best in two public text-to-music and -audio benchmarks and, differently from state-of-the-art models, can generate music with structure and stereo sounds.
PDF81December 15, 2024