ZeroSep：无需训练即可实现音频中的任意分离

摘要

音频源分离是机器理解复杂声学环境的基础，支撑着众多音频应用。当前基于监督学习的深度学习方法虽然强大，却受限于对大量任务特定标注数据的需求，且难以泛化到现实世界声学场景的巨大多变性和开放集特性。受生成式基础模型成功的启发，我们探究了预训练的文本引导音频扩散模型能否克服这些局限。我们有一个意外发现：在适当配置下，仅通过预训练的文本引导音频扩散模型即可实现零样本源分离。我们的方法名为ZeroSep，其工作原理是将混合音频反演至扩散模型的潜在空间，随后利用文本条件引导去噪过程以恢复各个声源。无需任何任务特定的训练或微调，ZeroSep便将生成式扩散模型重新用于判别式分离任务，并通过其丰富的文本先验天然支持开放集场景。ZeroSep兼容多种预训练的文本引导音频扩散模型骨架，在多个分离基准测试中展现出强劲的分离性能，甚至超越了监督学习方法。

English

Audio source separation is fundamental for machines to understand complex acoustic environments and underpins numerous audio applications. Current supervised deep learning approaches, while powerful, are limited by the need for extensive, task-specific labeled data and struggle to generalize to the immense variability and open-set nature of real-world acoustic scenes. Inspired by the success of generative foundation models, we investigate whether pre-trained text-guided audio diffusion models can overcome these limitations. We make a surprising discovery: zero-shot source separation can be achieved purely through a pre-trained text-guided audio diffusion model under the right configuration. Our method, named ZeroSep, works by inverting the mixed audio into the diffusion model's latent space and then using text conditioning to guide the denoising process to recover individual sources. Without any task-specific training or fine-tuning, ZeroSep repurposes the generative diffusion model for a discriminative separation task and inherently supports open-set scenarios through its rich textual priors. ZeroSep is compatible with a variety of pre-trained text-guided audio diffusion backbones and delivers strong separation performance on multiple separation benchmarks, surpassing even supervised methods.

ZeroSep：无需训练即可实现音频中的任意分离

ZeroSep: Separate Anything in Audio with Zero Training

摘要

Support