SAM音频:音频通用分割
SAM Audio: Segment Anything in Audio
December 19, 2025
作者: Bowen Shi, Andros Tjandra, John Hoffman, Helin Wang, Yi-Chiao Wu, Luya Gao, Julius Richter, Matt Le, Apoorv Vyas, Sanyuan Chen, Christoph Feichtenhofer, Piotr Dollár, Wei-Ning Hsu, Ann Lee
cs.AI
摘要
通用音频源分离是多模态AI系统感知与推理声音的关键能力。尽管近年来取得显著进展,但现有分离模型要么局限于特定领域(如专用于语音或音乐的固定类别),要么可控性受限(仅支持文本等单一提示模态)。本研究提出的SAM Audio基础模型,通过统一文本、视觉和时间跨度提示的框架,实现了通用音频分离。该模型基于扩散变换器架构,采用流匹配技术在海量语音、音乐及通用声音数据上进行训练,能够灵活分离通过语言描述、视觉掩码或时间跨度指定的目标声源。在涵盖自然场景音频与专业制作音频的通用声音、语音、音乐及乐器分离等多维度基准测试中,该模型均达到最先进性能,显著优于此前通用型与专用型系统。此外,我们引入了带有人工标注多模态提示的真实场景分离基准,以及与人耳评判高度相关的无参考评估模型。
English
General audio source separation is a key capability for multimodal AI systems that can perceive and reason about sound. Despite substantial progress in recent years, existing separation models are either domain-specific, designed for fixed categories such as speech or music, or limited in controllability, supporting only a single prompting modality such as text. In this work, we present SAM Audio, a foundation model for general audio separation that unifies text, visual, and temporal span prompting within a single framework. Built on a diffusion transformer architecture, SAM Audio is trained with flow matching on large-scale audio data spanning speech, music, and general sounds, and can flexibly separate target sources described by language, visual masks, or temporal spans. The model achieves state-of-the-art performance across a diverse suite of benchmarks, including general sound, speech, music, and musical instrument separation in both in-the-wild and professionally produced audios, substantially outperforming prior general-purpose and specialized systems. Furthermore, we introduce a new real-world separation benchmark with human-labeled multimodal prompts and a reference-free evaluation model that correlates strongly with human judgment.