스테이블 오디오 오픈

초록

오픈 생성 모델은 커뮤니티에 있어 매우 중요하며, 새로운 모델을 제시할 때 미세 조정(fine-tuning)이 가능하고 기준선(baseline)으로 활용될 수 있습니다. 그러나 현재 대부분의 텍스트-투-오디오(text-to-audio) 모델은 비공개 상태이며, 예술가와 연구자들이 이를 기반으로 구축할 수 없습니다. 여기서 우리는 Creative Commons 데이터로 훈련된 새로운 오픈 가중치(open-weights) 텍스트-투-오디오 모델의 아키텍처와 훈련 과정을 설명합니다. 우리의 평가는 이 모델의 성능이 다양한 지표에서 최신 기술 수준(state-of-the-art)과 경쟁력이 있음을 보여줍니다. 특히, 보고된 FDopenl3 결과(생성된 오디오의 현실감을 측정)는 44.1kHz에서 고품질 스테레오 사운드 합성의 잠재력을 보여줍니다.

English

Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.