Woosh: 사운드 효과 기반 모델

초록

오디오 연구 커뮤니티는 새로운 접근법을 구축하고 기준선을 설정하기 위한 핵심 도구로서 오픈 생성 모델에 의존하고 있습니다. 본 보고서에서는 Sony AI가 공개한 사운드 효과 기반 모델인 Woosh를 소개하며, 그 아키텍처, 학습 과정, 그리고 다른 주요 오픈 모델들과의 비교 평가를 상세히 설명합니다. 사운드 효과에 최적화된 본 모델은 (1) 고품질 오디오 인코더/디코더 모델과 (2) 조건 설정을 위한 텍스트-오디오 정렬 모델을 제공하며, (3) 텍스트-오디오 및 (4) 비디오-오디오 생성 모델을 함께 포함합니다. 리소스가 제한된 환경에서의 운영과 빠른 추론을 가능하게 하는 경량화된 텍스트-오디오 및 비디오-오디오 모델도 공개 버전에 포함되어 있습니다. 공개 및 비공개 데이터에 대한 평가 결과, 각 모듈은 StableAudio-Open 및 TangoFlux와 같은 기존 오픈 대안들에 비해 경쟁력 있거나 더 나은 성능을 보였습니다. 추론 코드와 모델 가중치는 https://github.com/SonyResearch/Woosh에서, 데모 샘플은 https://sonyresearch.github.io/Woosh/에서 확인할 수 있습니다.

English

The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI's publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high-quality audio encoder/decoder model and (2) a text-audio alignment model for conditioning, together with (3) text-to-audio and (4) video-to-audio generative models. Distilled text-to-audio and video-to-audio models are also included in the release, allowing for low-resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio-Open and TangoFlux. Inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.

Woosh: 사운드 효과 기반 모델

Woosh: A Sound Effects Foundation Model

초록

Support