ZeroSep：無需訓練即可分離音頻中的任何內容

摘要

音源分離是機器理解複雜聲學環境的基礎，也是眾多音頻應用的核心支撐。當前基於監督式深度學習的方法雖然強大，但受限於需要大量特定任務的標註數據，且難以應對現實世界聲學場景中巨大的變異性和開放性。受生成式基礎模型成功的啟發，我們探討了預訓練的文本引導音頻擴散模型是否能克服這些限制。我們發現了一個令人驚訝的現象：在適當配置下，僅通過預訓練的文本引導音頻擴散模型即可實現零樣本音源分離。我們的方法名為ZeroSep，其工作原理是將混合音頻反轉到擴散模型的潛在空間中，然後利用文本條件來引導去噪過程，從而恢復各個音源。ZeroSep無需任何特定任務的訓練或微調，便將生成式擴散模型重新用於判別式分離任務，並通過其豐富的文本先驗自然支持開放場景。ZeroSep兼容多種預訓練的文本引導音頻擴散模型骨幹，在多個分離基準上展現出強大的分離性能，甚至超越了監督式方法。

English

Audio source separation is fundamental for machines to understand complex acoustic environments and underpins numerous audio applications. Current supervised deep learning approaches, while powerful, are limited by the need for extensive, task-specific labeled data and struggle to generalize to the immense variability and open-set nature of real-world acoustic scenes. Inspired by the success of generative foundation models, we investigate whether pre-trained text-guided audio diffusion models can overcome these limitations. We make a surprising discovery: zero-shot source separation can be achieved purely through a pre-trained text-guided audio diffusion model under the right configuration. Our method, named ZeroSep, works by inverting the mixed audio into the diffusion model's latent space and then using text conditioning to guide the denoising process to recover individual sources. Without any task-specific training or fine-tuning, ZeroSep repurposes the generative diffusion model for a discriminative separation task and inherently supports open-set scenarios through its rich textual priors. ZeroSep is compatible with a variety of pre-trained text-guided audio diffusion backbones and delivers strong separation performance on multiple separation benchmarks, surpassing even supervised methods.

ZeroSep：無需訓練即可分離音頻中的任何內容

ZeroSep: Separate Anything in Audio with Zero Training

摘要

Support