빠른 시간 조건 잠재 오디오 확산

초록

텍스트 프롬프트에서 44.1kHz 스테레오 오디오를 장편으로 생성하는 것은 계산적으로 많은 자원을 요구할 수 있습니다. 더욱이, 대부분의 기존 연구는 음악과 사운드 효과가 자연스럽게 다양한 길이를 가진다는 점을 다루지 않았습니다. 우리의 연구는 생성 모델을 사용하여 텍스트 프롬프트로부터 44.1kHz의 장편 및 가변 길이 스테레오 음악과 사운드를 효율적으로 생성하는 데 초점을 맞추고 있습니다. Stable Audio는 잠재 확산(latent diffusion)을 기반으로 하며, 이 잠재 공간은 완전 컨볼루션 변이형 오토인코더(fully-convolutional variational autoencoder)에 의해 정의됩니다. 이 모델은 텍스트 프롬프트와 타이밍 임베딩(timing embeddings)을 조건으로 하여 생성된 음악과 사운드의 내용과 길이를 세밀하게 제어할 수 있습니다. Stable Audio는 A100 GPU에서 최대 95초 길이의 44.1kHz 스테레오 신호를 8초 만에 렌더링할 수 있습니다. 계산 효율성과 빠른 추론 속도에도 불구하고, 이 모델은 두 가지 공개된 텍스트-투-뮤직 및 오디오 벤치마크에서 최고 수준의 성능을 보이며, 최첨단 모델과 달리 구조화된 음악과 스테레오 사운드를 생성할 수 있습니다.

English

Generating long-form 44.1kHz stereo audio from text prompts can be computationally demanding. Further, most previous works do not tackle that music and sound effects naturally vary in their duration. Our research focuses on the efficient generation of long-form, variable-length stereo music and sounds at 44.1kHz using text prompts with a generative model. Stable Audio is based on latent diffusion, with its latent defined by a fully-convolutional variational autoencoder. It is conditioned on text prompts as well as timing embeddings, allowing for fine control over both the content and length of the generated music and sounds. Stable Audio is capable of rendering stereo signals of up to 95 sec at 44.1kHz in 8 sec on an A100 GPU. Despite its compute efficiency and fast inference, it is one of the best in two public text-to-music and -audio benchmarks and, differently from state-of-the-art models, can generate music with structure and stereo sounds.

빠른 시간 조건 잠재 오디오 확산

Fast Timing-Conditioned Latent Audio Diffusion

초록

Support