ChatPaper.aiChatPaper

基于查询的通用声音分离的语义一致性数据集

A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation

January 30, 2026
作者: Kai Li, Jintao Cheng, Chang Zeng, Zijun Yan, Helin Wang, Zixiong Su, Bo Zheng, Xiaolin Hu
cs.AI

摘要

基於查詢的通用聲音分離是智能聽覺系統的基礎技術,其目標是從混合音頻中分離特定聲源。儘管近期取得進展,現有方法在複雜聲學場景中仍存在殘餘干擾問題。這一性能侷限主要源於數據瓶頸:真實場景數據集存在標籤弱監督性與事件嚴重共現現象,導致模型學習背景噪聲與目標類別間的虛假相關性,而非魯棒的聲學特徵。為解決此問題,我們提出自動化流程,通過語義一致性合成協議從真實數據集中挖掘高純度單事件片段,徹底消除事件共現。利用該流程構建的Hive數據集包含2400小時原始音頻,是具備高質量的合成數據集。實驗結果表明:與基於規模超Hive500倍的龐大數據集訓練的頂尖模型SAM-Audio相比,使用Hive訓練的某些開源模型在分離精度與感知質量上達到相當水平。此外,這些模型在分佈外評估基準上展現出卓越的零樣本泛化能力。這些發現證明:提升監督信號純度能顯著增強數據效率,為以更低計算成本訓練魯棒聽覺基礎模型提供新範式。代碼與數據集詳見https://shandaai.github.io/Hive。
English
Query-based universal sound separation is fundamental to intelligent auditory systems, aiming to isolate specific sources from mixtures. Despite recent advances, existing methods continue to suffer from residual interference in complex acoustic scenes. This performance limitation stems largely from a data bottleneck: in-the-wild datasets contain weak labels and severe co-occurrence of events. These flaws induce models to learn spurious correlations between background noise and target categories instead of robust acoustic features. To address this, we propose an automated pipeline that eliminates co-occurrence of events by mining high-purity single-event segments from in-the-wild datasets via a semantically consistent synthesis protocol. Utilizing this pipeline, we constructed Hive, a high-quality synthetic dataset comprising 2.4k hours of raw audio. Experimental results demonstrate that, compared with the state-of-the-art model SAM-Audio which was trained on a huge dataset sim500 times larger than Hive, certain open-source models trained on Hive achieve competitive separation accuracy and perceptual quality. Moreover, these models exhibited remarkable zero-shot generalization on out-of-distribution evaluation benchmarks. These findings highlight that prioritizing purity of supervised signals enables significant data efficiency, offering a new paradigm for training robust auditory foundation models with reduced computational costs. Code and dataset are available at https://shandaai.github.io/Hive.
PDF42February 7, 2026