使用弱標記數據進行通用源分離
Universal Source Separation with Weakly Labelled Data
May 11, 2023
作者: Qiuqiang Kong, Ke Chen, Haohe Liu, Xingjian Du, Taylor Berg-Kirkpatrick, Shlomo Dubnov, Mark D. Plumbley
cs.AI
摘要
通用音源分離(USS)是計算聽覺場景分析的基礎研究任務,旨在將單聲道錄音分離為個別音源軌。音源分離任務面臨三個潛在挑戰等待解決。首先,先前的音源分離系統主要專注於分離一個或有限數量的特定音源,缺乏建立能透過單一模型分離任意音源的研究。其次,大多數先前的系統需要乾淨的音源數據來訓練分離器,而乾淨的音源數據稀缺。第三,缺乏能夠在分層級別自動檢測和分離活動聲音類別的USS系統。為了利用大規模弱標記/未標記的音頻數據進行音源分離,我們提出了一個通用音源分離框架,包括:1)在弱標記數據上訓練的音頻標記模型作為查詢網絡;和2)一個條件音源分離模型,該模型將查詢網絡的輸出作為條件來分離任意聲源。我們研究了各種查詢網絡、音源分離模型和訓練策略,提出了一種分層USS策略,以從AudioSet本體論中自動檢測和分離聲音類別。通過僅利用弱標記的AudioSet,我們的USS系統成功地分離了各種聲音類別,包括聲音事件分離、音樂音源分離和語音增強。USS系統在AudioSet的527個聲音類別上實現了平均信號失真比改善(SDRi)為5.57 dB;在DCASE 2018任務2數據集上為10.57 dB;在MUSDB18數據集上為8.12 dB;在Slakh2100數據集上為7.28 dB;在voicebank-demand數據集上為9.00 dB的SSNR。我們在https://github.com/bytedance/uss 上發布了源代碼。
English
Universal source separation (USS) is a fundamental research task for
computational auditory scene analysis, which aims to separate mono recordings
into individual source tracks. There are three potential challenges awaiting
the solution to the audio source separation task. First, previous audio source
separation systems mainly focus on separating one or a limited number of
specific sources. There is a lack of research on building a unified system that
can separate arbitrary sources via a single model. Second, most previous
systems require clean source data to train a separator, while clean source data
are scarce. Third, there is a lack of USS system that can automatically detect
and separate active sound classes in a hierarchical level. To use large-scale
weakly labeled/unlabeled audio data for audio source separation, we propose a
universal audio source separation framework containing: 1) an audio tagging
model trained on weakly labeled data as a query net; and 2) a conditional
source separation model that takes query net outputs as conditions to separate
arbitrary sound sources. We investigate various query nets, source separation
models, and training strategies and propose a hierarchical USS strategy to
automatically detect and separate sound classes from the AudioSet ontology. By
solely leveraging the weakly labelled AudioSet, our USS system is successful in
separating a wide variety of sound classes, including sound event separation,
music source separation, and speech enhancement. The USS system achieves an
average signal-to-distortion ratio improvement (SDRi) of 5.57 dB over 527 sound
classes of AudioSet; 10.57 dB on the DCASE 2018 Task 2 dataset; 8.12 dB on the
MUSDB18 dataset; an SDRi of 7.28 dB on the Slakh2100 dataset; and an SSNR of
9.00 dB on the voicebank-demand dataset. We release the source code at
https://github.com/bytedance/uss