稳定音频3

摘要

Stable Audio 3 是一系列高效潜在扩散模型（包含小、中、大三种规模），专为可变时长音频生成与编辑设计。由于模型可生成数分钟音频，变长生成机制能够有效避免为短音频生成全长内容带来的计算开销。我们还支持音频修复功能，可实现对目标区域的精准编辑及短录音的延展。这些潜在扩散模型基于新型语义-声学自编码器架构运行，该编码器将音频映射至紧凑的潜在空间，既能保持音频保真度，又能促进潜在空间中语义结构的形成，从而支持高效的扩散生成。最终通过对抗性后训练，在提升推理速度与生成质量的同时，大幅减少推理步数——不仅优化了保真度，还增强了提示文本的遵循程度。Stable Audio 3 模型基于授权与知识共享许可数据进行训练，在 H200 GPU 上生成音乐及音效耗时不足 2 秒，在 MacBook Pro M4 上仅需数秒。我们已开源可在消费级硬件上运行的小型与中型模型权重，并同步提供配套的训练与推理流程。

English

Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable-length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer-grade hardware, together with their training and inference pipeline.