通用音訊表示的自然語言監督
Natural Language Supervision for General-Purpose Audio Representations
September 11, 2023
作者: Benjamin Elizalde, Soham Deshmukh, Huaming Wang
cs.AI
摘要
音訊-語言模型共同學習多模態文本和音訊表示,實現零樣本推理。模型依賴編碼器來創建強大的輸入表示,並泛化到多個任務,包括聲音、音樂和語音。儘管模型取得了顯著的性能,但仍存在與特定任務模型之間的性能差距。在本文中,我們提出了一種對比語言-音訊預訓練模型,該模型使用兩個創新的編碼器對包含460萬音訊-文本對的多樣集合進行預訓練,以實現零樣本推理。為了學習音訊表示,我們在22個音訊任務上訓練了一個音訊編碼器,而不是標準的聲音事件分類訓練。為了學習語言表示,我們訓練了一個僅自回歸解碼器模型,而不是標準的僅編碼器模型。然後,通過對比學習將音訊和語言表示帶入聯合多模態空間。我們使用我們的編碼器在下游性能上取得了一定的改進。我們對我們的表示在26個下游任務上進行了廣泛評估,這在文獻中是最大的。我們的模型在幾個任務中取得了最先進的結果,引領通往通用音訊表示的道路。
English
Audio-Language models jointly learn multimodal text and audio representations
that enable Zero-Shot inference. Models rely on the encoders to create powerful
representations of the input and generalize to multiple tasks ranging from
sounds, music, and speech. Although models have achieved remarkable
performance, there is still a performance gap with task-specific models. In
this paper, we propose a Contrastive Language-Audio Pretraining model that is
pretrained with a diverse collection of 4.6M audio-text pairs employing two
innovative encoders for Zero-Shot inference. To learn audio representations, we
trained an audio encoder on 22 audio tasks, instead of the standard training of
sound event classification. To learn language representations, we trained an
autoregressive decoder-only model instead of the standard encoder-only models.
Then, the audio and language representations are brought into a joint
multimodal space using Contrastive Learning. We used our encoders to improve
the downstream performance by a margin. We extensively evaluated the
generalization of our representations on 26 downstream tasks, the largest in
the literature. Our model achieves state of the art results in several tasks
leading the way towards general-purpose audio representations.