ZeroSep: ゼロトレーニングでオーディオ内の任意の要素を分離

要旨

音源分離は、機械が複雑な音響環境を理解するための基盤であり、数多くの音響アプリケーションを支える重要な技術です。現在の教師あり深層学習アプローチは強力ではあるものの、大規模なタスク固有のラベル付きデータを必要とするという制約があり、現実世界の音響シーンの多様性やオープンセットの性質に対応する汎化能力に課題を抱えています。生成基盤モデルの成功に触発され、我々は事前学習済みのテキスト誘導型音響拡散モデルがこれらの制限を克服できるかどうかを調査しました。驚くべき発見として、適切な設定の下で、事前学習済みのテキスト誘導型音響拡散モデルだけでゼロショット音源分離が可能であることがわかりました。我々の手法「ZeroSep」は、混合音声を拡散モデルの潜在空間に逆変換し、テキスト条件付けを用いてノイズ除去プロセスを誘導し、個々の音源を復元するものです。タスク固有の学習や微調整を一切行わずに、ZeroSepは生成拡散モデルを識別的分離タスクに転用し、豊富なテキスト事前情報を通じてオープンセットシナリオを本質的にサポートします。ZeroSepは、様々な事前学習済みテキスト誘導型音響拡散バックボーンと互換性があり、複数の分離ベンチマークで強力な分離性能を発揮し、教師あり手法さえも凌駕します。

English

Audio source separation is fundamental for machines to understand complex acoustic environments and underpins numerous audio applications. Current supervised deep learning approaches, while powerful, are limited by the need for extensive, task-specific labeled data and struggle to generalize to the immense variability and open-set nature of real-world acoustic scenes. Inspired by the success of generative foundation models, we investigate whether pre-trained text-guided audio diffusion models can overcome these limitations. We make a surprising discovery: zero-shot source separation can be achieved purely through a pre-trained text-guided audio diffusion model under the right configuration. Our method, named ZeroSep, works by inverting the mixed audio into the diffusion model's latent space and then using text conditioning to guide the denoising process to recover individual sources. Without any task-specific training or fine-tuning, ZeroSep repurposes the generative diffusion model for a discriminative separation task and inherently supports open-set scenarios through its rich textual priors. ZeroSep is compatible with a variety of pre-trained text-guided audio diffusion backbones and delivers strong separation performance on multiple separation benchmarks, surpassing even supervised methods.