ステーブルオーディオ3

要旨

Stable Audio 3 は、可変長のオーディオ生成および編集を実現する、高速な潜在拡散モデル（small、medium、large）のファミリーです。本モデルは数分間のオーディオを生成できるため、短い音声に対してフルレングス生成のコストを避けるために可変長生成が重要となります。さらに、インペインティングをサポートしており、ターゲットを絞ったオーディオ編集や短い録音の継続が可能です。この潜在拡散モデルは、新たな意味音響オートエンコーダ上で動作し、オーディオをコンパクトな潜在空間に投影することで、オーディオの忠実度を維持しつつ潜在空間内の意味構造を促進しながら、効率的な拡散ベースの生成を実現します。最後に、敵対的事後学習を実施することで、推論の高速化と生成品質の向上を同時に達成し、推論ステップ数を削減しつつ忠実度とプロンプトへの適合性を改善します。Stable Audio 3 モデルは、ライセンスおよびクリエイティブ・コモンズのデータでトレーニングされ、H200 GPU 上では 2 秒未満、MacBook Pro M4 では数秒未満で音楽やサウンドを生成します。コンシューマー向けハードウェアでも動作可能な small および medium の重みを、トレーニングおよび推論パイプラインとともに公開します。

English

Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable-length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer-grade hardware, together with their training and inference pipeline.