Woosh：音效基础模型

摘要

音频研究领域依赖开放生成模型作为构建创新方法和建立基准的基础工具。本报告介绍索尼AI公开推出的音效基础模型Woosh，详细阐述其架构设计、训练流程以及与主流开放模型的对比评估。该模型专为音效生成优化，提供：(1) 高质量音频编码器/解码器模型，(2) 用于条件控制的文本-音频对齐模型，(3) 文本到音频及(4) 视频到音频生成模型。此次发布同时包含蒸馏后的文本/视频到音频模型，支持低资源环境下的快速推理。在公开与私有数据集上的评估表明，相较于StableAudio-Open、TangoFlux等现有开放方案，各模块均展现出竞争优势或更优性能。推理代码与模型权重已发布于https://github.com/SonyResearch/Woosh，演示样本可访问https://sonyresearch.github.io/Woosh/获取。

English

The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI's publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high-quality audio encoder/decoder model and (2) a text-audio alignment model for conditioning, together with (3) text-to-audio and (4) video-to-audio generative models. Distilled text-to-audio and video-to-audio models are also included in the release, allowing for low-resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio-Open and TangoFlux. Inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.

Woosh：音效基础模型

Woosh: A Sound Effects Foundation Model

摘要

Support