高速タイミング条件付き潜在音声拡散

要旨

テキストプロンプトから44.1kHzステレオ音声の長尺生成を行うことは、計算リソースを多く必要とする課題です。さらに、これまでの研究の多くは、音楽や効果音が自然に持つ時間的な長さの多様性に対応していませんでした。本研究では、生成モデルを用いてテキストプロンプトから44.1kHzの長尺で可変長のステレオ音楽や音響を効率的に生成することに焦点を当てています。Stable Audioは潜在拡散モデルを基盤としており、その潜在空間は完全畳み込み型の変分オートエンコーダによって定義されています。テキストプロンプトに加えてタイミング埋め込みを条件付けすることで、生成される音楽や音響の内容と長さを細かく制御することが可能です。Stable Audioは、A100 GPU上で8秒間で最大95秒の44.1kHzステレオ信号を生成することができます。計算効率が高く推論が高速であるにもかかわらず、2つの公開されているテキストから音楽および音響を生成するベンチマークにおいて最高レベルの性能を発揮し、最先端のモデルとは異なり、構造を持った音楽やステレオ音響を生成することが可能です。

English

Generating long-form 44.1kHz stereo audio from text prompts can be computationally demanding. Further, most previous works do not tackle that music and sound effects naturally vary in their duration. Our research focuses on the efficient generation of long-form, variable-length stereo music and sounds at 44.1kHz using text prompts with a generative model. Stable Audio is based on latent diffusion, with its latent defined by a fully-convolutional variational autoencoder. It is conditioned on text prompts as well as timing embeddings, allowing for fine control over both the content and length of the generated music and sounds. Stable Audio is capable of rendering stereo signals of up to 95 sec at 44.1kHz in 8 sec on an A100 GPU. Despite its compute efficiency and fast inference, it is one of the best in two public text-to-music and -audio benchmarks and, differently from state-of-the-art models, can generate music with structure and stereo sounds.

高速タイミング条件付き潜在音声拡散

Fast Timing-Conditioned Latent Audio Diffusion

要旨

Support