面向数据高效型查询的通用声音分离语义一致性数据集

摘要

基于查询的通用声音分离是智能听觉系统的基础，旨在从混合音频中分离特定声源。尽管近期取得进展，现有方法在复杂声学场景中仍存在残留干扰问题。这一性能局限主要源于数据瓶颈：真实场景数据集存在标签弱监督性及事件严重共现现象，导致模型学习到背景噪声与目标类别间的伪相关性而非鲁棒的声学特征。为此，我们提出自动化流程，通过语义一致性合成协议从真实数据集中挖掘高纯度单事件片段以消除事件共现。基于该流程，我们构建了包含2400小时原始音频的高质量合成数据集Hive。实验表明，与基于比Hive大500倍数据集训练的先进模型SAM-Audio相比，某些在Hive上训练的开源模型实现了相当的分离精度与感知质量。此外，这些模型在分布外评估基准上展现出卓越的零样本泛化能力。这些发现证明，提升监督信号纯度可显著提高数据效率，为以更低计算成本训练鲁棒听觉基础模型提供了新范式。代码与数据集详见https://shandaai.github.io/Hive。

English

Query-based universal sound separation is fundamental to intelligent auditory systems, aiming to isolate specific sources from mixtures. Despite recent advances, existing methods continue to suffer from residual interference in complex acoustic scenes. This performance limitation stems largely from a data bottleneck: in-the-wild datasets contain weak labels and severe co-occurrence of events. These flaws induce models to learn spurious correlations between background noise and target categories instead of robust acoustic features. To address this, we propose an automated pipeline that eliminates co-occurrence of events by mining high-purity single-event segments from in-the-wild datasets via a semantically consistent synthesis protocol. Utilizing this pipeline, we constructed Hive, a high-quality synthetic dataset comprising 2.4k hours of raw audio. Experimental results demonstrate that, compared with the state-of-the-art model SAM-Audio which was trained on a huge dataset sim500 times larger than Hive, certain open-source models trained on Hive achieve competitive separation accuracy and perceptual quality. Moreover, these models exhibited remarkable zero-shot generalization on out-of-distribution evaluation benchmarks. These findings highlight that prioritizing purity of supervised signals enables significant data efficiency, offering a new paradigm for training robust auditory foundation models with reduced computational costs. Code and dataset are available at https://shandaai.github.io/Hive.

面向数据高效型查询的通用声音分离语义一致性数据集

A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation

摘要

Support