音频插槽：一种以插槽为中心的音频分离生成模型

摘要

在一系列最近的研究中，对象中心架构已被证明适用于视觉领域中的无监督场景分解。受这些方法的启发，我们提出了AudioSlots，这是一个针对音频领域的以槽为中心的生成模型，用于盲源分离。AudioSlots采用置换等变编码器和解码器网络构建。基于Transformer架构的编码器网络学习将混合音频频谱图映射到一个无序的独立源嵌入集。空间广播解码器网络学习从源嵌入生成源频谱图。我们使用一个置换不变损失函数以端到端的方式训练模型。我们在Libri2Mix语音分离上的结果构成了这种方法显示潜力的概念验证。我们详细讨论了我们方法的结果和局限性，并进一步概述了克服这些局限性和未来工作方向的潜在方法。

English

In a range of recent works, object-centric architectures have been shown to be suitable for unsupervised scene decomposition in the vision domain. Inspired by these methods we present AudioSlots, a slot-centric generative model for blind source separation in the audio domain. AudioSlots is built using permutation-equivariant encoder and decoder networks. The encoder network based on the Transformer architecture learns to map a mixed audio spectrogram to an unordered set of independent source embeddings. The spatial broadcast decoder network learns to generate the source spectrograms from the source embeddings. We train the model in an end-to-end manner using a permutation invariant loss function. Our results on Libri2Mix speech separation constitute a proof of concept that this approach shows promise. We discuss the results and limitations of our approach in detail, and further outline potential ways to overcome the limitations and directions for future work.

音频插槽：一种以插槽为中心的音频分离生成模型

AudioSlots: A slot-centric generative model for audio separation

摘要

Support