약한 레이블 데이터를 활용한 범용 소스 분리

초록

범용 음원 분리(Universal Source Separation, USS)는 계산적 청각 장면 분석을 위한 핵심 연구 과제로, 모노 녹음을 개별 음원 트랙으로 분리하는 것을 목표로 합니다. 음원 분리 과제를 해결하기 위해 기다리고 있는 세 가지 주요 도전 과제가 있습니다. 첫째, 기존의 음원 분리 시스템은 주로 하나 또는 제한된 수의 특정 음원을 분리하는 데 초점을 맞추고 있습니다. 단일 모델을 통해 임의의 음원을 분리할 수 있는 통합 시스템을 구축하는 연구가 부족합니다. 둘째, 대부분의 기존 시스템은 분리기를 학습시키기 위해 깨끗한 음원 데이터를 필요로 하지만, 이러한 데이터는 희소합니다. 셋째, 계층적 수준에서 활성 사운드 클래스를 자동으로 감지하고 분리할 수 있는 USS 시스템이 부족합니다. 대규모의 약한 레이블/비레이블 오디오 데이터를 음원 분리에 활용하기 위해, 우리는 다음과 같은 범용 오디오 음원 분리 프레임워크를 제안합니다: 1) 약한 레이블 데이터로 학습된 오디오 태깅 모델을 쿼리 네트워크로 사용하고, 2) 쿼리 네트워크의 출력을 조건으로 사용하여 임의의 음원을 분리하는 조건부 음원 분리 모델. 우리는 다양한 쿼리 네트워크, 음원 분리 모델, 그리고 학습 전략을 탐구하고, AudioSet 온톨로지에서 사운드 클래스를 자동으로 감지하고 분리하기 위한 계층적 USS 전략을 제안합니다. 약한 레이블이 달린 AudioSet만을 활용하여, 우리의 USS 시스템은 사운드 이벤트 분리, 음악 음원 분리, 그리고 음성 향상 등 다양한 사운드 클래스를 성공적으로 분리합니다. USS 시스템은 AudioSet의 527개 사운드 클래스에서 평균 5.57 dB의 신호 대 왜곡 비율 개선(SDRi)을 달성했으며, DCASE 2018 Task 2 데이터셋에서는 10.57 dB, MUSDB18 데이터셋에서는 8.12 dB, Slakh2100 데이터셋에서는 7.28 dB, 그리고 voicebank-demand 데이터셋에서는 9.00 dB의 SSNR을 기록했습니다. 소스 코드는 https://github.com/bytedance/uss에서 공개되었습니다.

English

Universal source separation (USS) is a fundamental research task for computational auditory scene analysis, which aims to separate mono recordings into individual source tracks. There are three potential challenges awaiting the solution to the audio source separation task. First, previous audio source separation systems mainly focus on separating one or a limited number of specific sources. There is a lack of research on building a unified system that can separate arbitrary sources via a single model. Second, most previous systems require clean source data to train a separator, while clean source data are scarce. Third, there is a lack of USS system that can automatically detect and separate active sound classes in a hierarchical level. To use large-scale weakly labeled/unlabeled audio data for audio source separation, we propose a universal audio source separation framework containing: 1) an audio tagging model trained on weakly labeled data as a query net; and 2) a conditional source separation model that takes query net outputs as conditions to separate arbitrary sound sources. We investigate various query nets, source separation models, and training strategies and propose a hierarchical USS strategy to automatically detect and separate sound classes from the AudioSet ontology. By solely leveraging the weakly labelled AudioSet, our USS system is successful in separating a wide variety of sound classes, including sound event separation, music source separation, and speech enhancement. The USS system achieves an average signal-to-distortion ratio improvement (SDRi) of 5.57 dB over 527 sound classes of AudioSet; 10.57 dB on the DCASE 2018 Task 2 dataset; 8.12 dB on the MUSDB18 dataset; an SDRi of 7.28 dB on the Slakh2100 dataset; and an SSNR of 9.00 dB on the voicebank-demand dataset. We release the source code at https://github.com/bytedance/uss

약한 레이블 데이터를 활용한 범용 소스 분리

Universal Source Separation with Weakly Labelled Data

초록

Support