Pengi：一個用於音頻任務的音頻語言模型

摘要

在音訊處理領域，遷移學習促進了自監督學習和零樣本學習技術的崛起。這些方法導致了能夠應對各種任務並提供最先進性能的多功能模型的發展。然而，目前的模型固有地缺乏產生開放式任務所需語言的能力，例如音訊字幕或音訊問答。我們介紹了Pengi，一種新穎的音訊語言模型，利用遷移學習將所有音訊任務都框定為文本生成任務。它以音訊錄製和文本作為輸入，並生成自由形式文本作為輸出。輸入音訊由音訊編碼器表示為連續嵌入的序列。文本編碼器對應的文本輸入執行相同操作。這兩個序列結合為前綴以提示預先訓練的凍結語言模型。Pengi的統一架構使其能夠處理開放式任務和無需進行額外微調或任務特定擴展的閉合式任務。在對22個下游任務進行評估時，我們的方法在其中幾個任務中實現了最先進的性能。我們的結果表明，將語言模型與音訊模型相連是邁向通用音訊理解的重要一步。

English

In the domain of audio processing, Transfer Learning has facilitated the rise of Self-Supervised Learning and Zero-Shot Learning techniques. These approaches have led to the development of versatile models capable of tackling a wide array of tasks, while delivering state-of-the-art performance. However, current models inherently lack the capacity to produce the requisite language for open-ended tasks, such as Audio Captioning or Audio Question & Answering. We introduce Pengi, a novel Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks. It takes as input, an audio recording, and text, and generates free-form text as output. The input audio is represented as a sequence of continuous embeddings by an audio encoder. A text encoder does the same for the corresponding text input. Both sequences are combined as a prefix to prompt a pre-trained frozen language model. The unified architecture of Pengi enables open-ended tasks and close-ended tasks without any additional fine-tuning or task-specific extensions. When evaluated on 22 downstream tasks, our approach yields state-of-the-art performance in several of them. Our results show that connecting language models with audio models is a major step towards general-purpose audio understanding

Pengi：一個用於音頻任務的音頻語言模型

Pengi: An Audio Language Model for Audio Tasks

摘要

Support