AudioSlots：オーディオ分離のためのスロット中心生成モデル

要旨

近年の一連の研究において、物体中心のアーキテクチャが視覚領域における教師なしシーン分解に適していることが示されてきました。これらの手法に着想を得て、本論文では音声領域におけるブラインド音源分離のためのスロット中心の生成モデルであるAudioSlotsを提案します。AudioSlotsは、順序不変のエンコーダネットワークとデコーダネットワークを用いて構築されています。Transformerアーキテクチャに基づくエンコーダネットワークは、混合音声スペクトログラムを順序のない独立した音源埋め込みの集合にマッピングすることを学習します。空間ブロードキャストデコーダネットワークは、音源埋め込みから音源スペクトログラムを生成することを学習します。順序不変の損失関数を用いて、モデルをエンドツーエンドで学習させます。Libri2Mix音声分離における結果は、このアプローチが有望であることを示す概念実証となります。本手法の結果と限界について詳細に議論し、さらに限界を克服するための潜在的な方法と今後の研究方向性について概説します。

English

In a range of recent works, object-centric architectures have been shown to be suitable for unsupervised scene decomposition in the vision domain. Inspired by these methods we present AudioSlots, a slot-centric generative model for blind source separation in the audio domain. AudioSlots is built using permutation-equivariant encoder and decoder networks. The encoder network based on the Transformer architecture learns to map a mixed audio spectrogram to an unordered set of independent source embeddings. The spatial broadcast decoder network learns to generate the source spectrograms from the source embeddings. We train the model in an end-to-end manner using a permutation invariant loss function. Our results on Libri2Mix speech separation constitute a proof of concept that this approach shows promise. We discuss the results and limitations of our approach in detail, and further outline potential ways to overcome the limitations and directions for future work.

AudioSlots：オーディオ分離のためのスロット中心生成モデル

AudioSlots: A slot-centric generative model for audio separation

要旨

Support