Woosh: 音響効果の基盤モデル

要旨

音響研究コミュニティは、新たなアプローチを構築し、ベースラインを確立するための基盤ツールとして、オープンな生成モデルに依存しています。本報告では、Sony AIが公開したサウンドエフェクト基盤モデル「Woosh」を紹介し、そのアーキテクチャ、学習プロセス、および他の主要なオープンモデルとの比較評価を詳述します。サウンドエフェクトに最適化された本モデルは、(1) 高品質なオーディオエンコーダ/デコーダモデルと、(2) 条件付けのためのテキスト-オーディオ整合モデル、さらに(3) テキストからオーディオを生成するモデルおよび(4) ビデオからオーディオを生成するモデルを提供します。リソースが限られた環境での動作や高速推論を可能とする、蒸留版のテキスト-to-オーディオおよびビデオ-to-オーディオモデルも公開に含まれています。公開データおよび非公開データを用いた評価では、StableAudio-OpenやTangoFluxなどの既存のオープンな代替モデルと比較して、各モジュールが同等以上の性能を示しました。推論コードとモデル重みは https://github.com/SonyResearch/Woosh で、デモサンプルは https://sonyresearch.github.io/Woosh/ でそれぞれ公開されています。

English

The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI's publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high-quality audio encoder/decoder model and (2) a text-audio alignment model for conditioning, together with (3) text-to-audio and (4) video-to-audio generative models. Distilled text-to-audio and video-to-audio models are also included in the release, allowing for low-resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio-Open and TangoFlux. Inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.

Woosh: 音響効果の基盤モデル

Woosh: A Sound Effects Foundation Model

要旨

Support