ZeroSep: 훈련 없이 오디오에서 모든 것을 분리하기

초록

오디오 소스 분리는 기계가 복잡한 음향 환경을 이해하고 수많은 오디오 애플리케이션을 지원하는 데 필수적인 기술입니다. 현재의 지도 학습 기반 딥러닝 접근법은 강력하지만, 방대한 양의 작업별 레이블 데이터가 필요하고 실제 세계의 광범위한 변동성과 열린 집합(open-set) 특성에 일반화하기 어렵다는 한계가 있습니다. 생성 기반 모델의 성공에서 영감을 받아, 우리는 사전 학습된 텍스트 지향 오디오 확산 모델이 이러한 한계를 극복할 수 있는지 연구했습니다. 우리는 놀라운 발견을 했습니다: 적절한 설정 하에서 사전 학습된 텍스트 지향 오디오 확산 모델만으로도 제로샷(zero-shot) 소스 분리가 가능하다는 것입니다. 우리의 방법인 ZeroSep은 혼합된 오디오를 확산 모델의 잠재 공간으로 역변환한 후, 텍스트 조건을 사용하여 잡음 제거 과정을 안내하여 개별 소스를 복구하는 방식으로 작동합니다. ZeroSep은 작업별 학습이나 미세 조정 없이 생성 확산 모델을 판별적 분리 작업에 재사용하며, 풍부한 텍스트 사전 정보를 통해 열린 집합 시나리오를 본질적으로 지원합니다. ZeroSep은 다양한 사전 학습된 텍스트 지향 오디오 확산 백본과 호환되며, 여러 분리 벤치마크에서 강력한 분리 성능을 보여 감독 학습 방법을 능가하기도 합니다.

English

Audio source separation is fundamental for machines to understand complex acoustic environments and underpins numerous audio applications. Current supervised deep learning approaches, while powerful, are limited by the need for extensive, task-specific labeled data and struggle to generalize to the immense variability and open-set nature of real-world acoustic scenes. Inspired by the success of generative foundation models, we investigate whether pre-trained text-guided audio diffusion models can overcome these limitations. We make a surprising discovery: zero-shot source separation can be achieved purely through a pre-trained text-guided audio diffusion model under the right configuration. Our method, named ZeroSep, works by inverting the mixed audio into the diffusion model's latent space and then using text conditioning to guide the denoising process to recover individual sources. Without any task-specific training or fine-tuning, ZeroSep repurposes the generative diffusion model for a discriminative separation task and inherently supports open-set scenarios through its rich textual priors. ZeroSep is compatible with a variety of pre-trained text-guided audio diffusion backbones and delivers strong separation performance on multiple separation benchmarks, surpassing even supervised methods.