ChatPaper.aiChatPaper

Stable Audio 3

Stable Audio 3

May 18, 2026
作者: Zach Evans, Julian D. Parker, Matthew Rice, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons
cs.AI

摘要

Stable Audio 3 是一系列快速潛在擴散模型(包含小、中、大三種規模),專為可變長度的音訊生成與編輯而設計。由於我們的模型能夠生成數分鐘的音訊,因此採用可變長度生成機制,可避免為了生成短音效而耗費完整長度生成的運算成本。我們亦支援音訊修補功能,可實現針對性音訊編輯以及短錄音的延續生成。這些潛在擴散模型建立在新型的語義聲學自編碼器之上,該編碼器能將音訊投影至緊湊的潛在空間,從而在實現高效擴散生成之餘,同時保留音訊保真度並促進潛在空間內的語義結構。最後,我們採用對抗式後訓練,既能加速推理又能提升生成品質,在減少推理步驟數的同時,提高保真度與提示遵循度。Stable Audio 3 模型使用授權資料與創用 CC 資料進行訓練,能夠在 H200 GPU 上以不到 2 秒的速度生成音樂與音效,在 MacBook Pro M4 上則僅需數秒。我們開源了小規模與中規模模型的權重,這些模型可在消費級硬體上運行,並附帶其訓練與推理管線。
English
Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable-length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer-grade hardware, together with their training and inference pipeline.