弱标记数据的通用源分离

摘要

通用音源分离（USS）是计算听觉场景分析的基础研究任务，旨在将单声道录音分离为各个独立的音源轨道。音频源分离任务面临三个潜在挑战等待解决。首先，先前的音频源分离系统主要集中在分离一个或有限数量的特定音源上。缺乏研究建立一个能够通过单一模型分离任意音源的统一系统。其次，大多数先前的系统需要干净的音源数据来训练分离器，而干净的音源数据稀缺。第三，缺乏能够在分层级别自动检测和分离活动声音类别的USS系统。为了利用大规模的弱标记/未标记音频数据进行音频源分离，我们提出了一个通用音频源分离框架，包括：1）在弱标记数据上训练的音频标记模型作为查询网络；和2）一个条件音源分离模型，以查询网络输出作为条件来分离任意声音源。我们研究了各种查询网络、音源分离模型和训练策略，并提出了一种分层USS策略，以从AudioSet本体中自动检测和分离声音类别。通过仅利用弱标记的AudioSet，我们的USS系统成功地分离了各种声音类别，包括声音事件分离、音乐源分离和语音增强。该USS系统在AudioSet的527个声音类别上实现了平均信号失真比改进（SDRi）为5.57 dB；在DCASE 2018任务2数据集上为10.57 dB；在MUSDB18数据集上为8.12 dB；在Slakh2100数据集上为7.28 dB；在voicebank-demand数据集上为9.00 dB的SSNR。我们在https://github.com/bytedance/uss发布了源代码。

English

Universal source separation (USS) is a fundamental research task for computational auditory scene analysis, which aims to separate mono recordings into individual source tracks. There are three potential challenges awaiting the solution to the audio source separation task. First, previous audio source separation systems mainly focus on separating one or a limited number of specific sources. There is a lack of research on building a unified system that can separate arbitrary sources via a single model. Second, most previous systems require clean source data to train a separator, while clean source data are scarce. Third, there is a lack of USS system that can automatically detect and separate active sound classes in a hierarchical level. To use large-scale weakly labeled/unlabeled audio data for audio source separation, we propose a universal audio source separation framework containing: 1) an audio tagging model trained on weakly labeled data as a query net; and 2) a conditional source separation model that takes query net outputs as conditions to separate arbitrary sound sources. We investigate various query nets, source separation models, and training strategies and propose a hierarchical USS strategy to automatically detect and separate sound classes from the AudioSet ontology. By solely leveraging the weakly labelled AudioSet, our USS system is successful in separating a wide variety of sound classes, including sound event separation, music source separation, and speech enhancement. The USS system achieves an average signal-to-distortion ratio improvement (SDRi) of 5.57 dB over 527 sound classes of AudioSet; 10.57 dB on the DCASE 2018 Task 2 dataset; 8.12 dB on the MUSDB18 dataset; an SDRi of 7.28 dB on the Slakh2100 dataset; and an SSNR of 9.00 dB on the voicebank-demand dataset. We release the source code at https://github.com/bytedance/uss

弱标记数据的通用源分离

Universal Source Separation with Weakly Labelled Data

摘要

Support