ChatPaper.aiChatPaper

SAM Audio:音訊萬物分割

SAM Audio: Segment Anything in Audio

December 19, 2025
作者: Bowen Shi, Andros Tjandra, John Hoffman, Helin Wang, Yi-Chiao Wu, Luya Gao, Julius Richter, Matt Le, Apoorv Vyas, Sanyuan Chen, Christoph Feichtenhofer, Piotr Dollár, Wei-Ning Hsu, Ann Lee
cs.AI

摘要

通用音訊源分離是多模態人工智慧系統感知與推理聲音的關鍵能力。儘管近年取得顯著進展,現有分離模型仍存在局限:要么是針對語音或音樂等固定類別的領域特定模型,要么可控性不足,僅支持文本等單一提示模態。本研究提出SAM Audio——一個通用音訊分離基礎模型,首次將文本、視覺和時間跨度提示統一於單一框架。該模型基於擴散轉換器架構,通過流匹配技術在涵蓋語音、音樂及通用聲音的大規模音訊數據上訓練,能靈活分離由語言描述、視覺遮罩或時間跨度指定的目標聲源。該模型在多元基準測試中實現最先進性能,包括野外錄音與專業製作音訊中的通用聲音、語音、音樂及樂器分離任務,顯著超越先前通用型與專用型系統。此外,我們引入帶有人工標注多模態提示的真實場景分離基準,以及與人類判斷高度相關的無參考評估模型。
English
General audio source separation is a key capability for multimodal AI systems that can perceive and reason about sound. Despite substantial progress in recent years, existing separation models are either domain-specific, designed for fixed categories such as speech or music, or limited in controllability, supporting only a single prompting modality such as text. In this work, we present SAM Audio, a foundation model for general audio separation that unifies text, visual, and temporal span prompting within a single framework. Built on a diffusion transformer architecture, SAM Audio is trained with flow matching on large-scale audio data spanning speech, music, and general sounds, and can flexibly separate target sources described by language, visual masks, or temporal spans. The model achieves state-of-the-art performance across a diverse suite of benchmarks, including general sound, speech, music, and musical instrument separation in both in-the-wild and professionally produced audios, substantially outperforming prior general-purpose and specialized systems. Furthermore, we introduce a new real-world separation benchmark with human-labeled multimodal prompts and a reference-free evaluation model that correlates strongly with human judgment.
PDF100December 25, 2025