SoloAudio:基于语言导向音频扩散Transformer的目标声音提取
SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer
September 12, 2024
作者: Helin Wang, Jiarui Hai, Yen-Ju Lu, Karan Thakkar, Mounya Elhilali, Najim Dehak
cs.AI
摘要
本文介绍了SoloAudio,一种用于目标声音提取(TSE)的基于扩散的生成模型。我们的方法在音频上训练潜在扩散模型,将先前的U-Net骨干网络替换为在潜在特征上操作的跳跃连接Transformer。SoloAudio通过利用CLAP模型作为目标声音的特征提取器,支持面向音频和面向语言的TSE。此外,SoloAudio利用由最先进的文本转音频模型生成的合成音频进行训练,展示了对领域外数据和未见过的声音事件的强大泛化能力。我们在FSD Kaggle 2018混合数据集和来自AudioSet的真实数据上评估了这一方法,在领域内和领域外数据上,SoloAudio均取得了最先进的结果,并展现了令人印象深刻的零样本和少样本能力。源代码和演示已发布。
English
In this paper, we introduce SoloAudio, a novel diffusion-based generative
model for target sound extraction (TSE). Our approach trains latent diffusion
models on audio, replacing the previous U-Net backbone with a skip-connected
Transformer that operates on latent features. SoloAudio supports both
audio-oriented and language-oriented TSE by utilizing a CLAP model as the
feature extractor for target sounds. Furthermore, SoloAudio leverages synthetic
audio generated by state-of-the-art text-to-audio models for training,
demonstrating strong generalization to out-of-domain data and unseen sound
events. We evaluate this approach on the FSD Kaggle 2018 mixture dataset and
real data from AudioSet, where SoloAudio achieves the state-of-the-art results
on both in-domain and out-of-domain data, and exhibits impressive zero-shot and
few-shot capabilities. Source code and demos are released.Summary
AI-Generated Summary