音訊插槽：一個以插槽為中心的音訊分離生成模型

摘要

在一系列最近的研究中，已經顯示物件中心架構在視覺領域的無監督場景分解中是合適的。受到這些方法的啟發，我們提出了AudioSlots，這是一種以插槽為中心的生成模型，用於音頻領域的盲源分離。AudioSlots 使用置換等變編碼器和解碼器網絡構建。基於 Transformer 架構的編碼器網絡學習將混合音頻頻譜圖映射到一組無序的獨立源嵌入。空間廣播解碼器網絡學習從源嵌入生成源頻譜圖。我們使用一個置換不變損失函數以端對端的方式訓練模型。我們在 Libri2Mix 語音分離上的結果證明了這種方法具有潛力。我們詳細討論了我們方法的結果和限制，並進一步概述了克服這些限制和未來工作方向的潛在方法。

English

In a range of recent works, object-centric architectures have been shown to be suitable for unsupervised scene decomposition in the vision domain. Inspired by these methods we present AudioSlots, a slot-centric generative model for blind source separation in the audio domain. AudioSlots is built using permutation-equivariant encoder and decoder networks. The encoder network based on the Transformer architecture learns to map a mixed audio spectrogram to an unordered set of independent source embeddings. The spatial broadcast decoder network learns to generate the source spectrograms from the source embeddings. We train the model in an end-to-end manner using a permutation invariant loss function. Our results on Libri2Mix speech separation constitute a proof of concept that this approach shows promise. We discuss the results and limitations of our approach in detail, and further outline potential ways to overcome the limitations and directions for future work.

音訊插槽：一個以插槽為中心的音訊分離生成模型

AudioSlots: A slot-centric generative model for audio separation

摘要

Support