스테이블 오디오 3

초록

스테이블 오디오 3(Stable Audio 3)는 가변 길이 오디오 생성 및 편집을 위한 고속 잠재 확산 모델(소형, 중형, 대형) 제품군입니다. 당사 모델은 수 분 분량의 오디오를 생성할 수 있으므로, 짧은 사운드를 위해 전체 길이의 생성물을 제작하는 비용을 피하기 위해서는 가변 길이 생성이 핵심적입니다. 또한 인페인팅을 지원하여 표적 오디오 편집과 짧은 녹음의 연속을 가능하게 합니다. 당사의 잠재 확산 모델은 오디오를 압축된 잠재 공간으로 투영하는 새로운 의미-음향 오토인코더 위에서 작동하며, 이를 통해 효율적인 확산 기반 생성을 가능하게 하면서 오디오 충실도를 유지하고 잠재 공간 내에서 의미 구조를 촉진합니다. 마지막으로, 추론 가속화와 생성 품질 향상을 위해 적대적 사후 학습을 수행하여 추론 단계 수를 줄이면서 충실도와 프롬프트 준수도를 개선합니다. 스테이블 오디오 3 모델은 라이선스 및 크리에이티브 커먼즈 데이터로 학습되었으며, H200 GPU에서 2초 미만, MacBook Pro M4에서 수 초 이내에 음악과 사운드를 생성할 수 있습니다. 당사는 소형 및 중형 모델의 가중치를 학습 및 추론 파이프라인과 함께 공개하며, 이 모델들은 소비자용 하드웨어에서 실행 가능합니다.

English

Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable-length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer-grade hardware, together with their training and inference pipeline.