Woosh：音效基础模型

摘要

音频研究领域依赖开放生成模型作为构建创新方法和建立基准的基础工具。本报告介绍索尼AI公开推出的音效基础模型Woosh，详述其架构设计、训练流程以及与主流开放模型的对比评估。该模型针对音效生成进行优化，提供：(1) 高质量音频编解码器模型，(2) 用于条件控制的文本-音频对齐模型，(3) 文本到音频及(4) 视频到音频的生成模型。本次发布同时包含蒸馏优化的文本/视频到音频模型，支持低资源运行与快速推理。在公开及私有数据集上的评估表明，相较于StableAudio-Open、TangoFlux等现有开放方案，各模块均展现出竞争优势。推理代码与模型权重已发布于https://github.com/SonyResearch/Woosh，演示样本可访问https://sonyresearch.github.io/Woosh/。

English

The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI's publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high-quality audio encoder/decoder model and (2) a text-audio alignment model for conditioning, together with (3) text-to-audio and (4) video-to-audio generative models. Distilled text-to-audio and video-to-audio models are also included in the release, allowing for low-resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio-Open and TangoFlux. Inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.