Pengi: 音声タスクのための音声言語モデル

要旨

音声処理の領域において、転移学習（Transfer Learning）は自己教師あり学習（Self-Supervised Learning）やゼロショット学習（Zero-Shot Learning）技術の台頭を促進してきた。これらのアプローチにより、多様なタスクに対応可能で、最先端の性能を発揮する汎用モデルの開発が進められている。しかし、現在のモデルは本質的に、音声キャプショニング（Audio Captioning）や音声質問応答（Audio Question & Answering）といったオープンエンドタスクに必要な言語を生成する能力を欠いている。本論文では、Pengiという新しい音声言語モデルを提案する。Pengiは、すべての音声タスクをテキスト生成タスクとして再構築することで転移学習を活用する。入力として音声記録とテキストを受け取り、自由形式のテキストを出力として生成する。入力音声は、音声エンコーダによって連続的な埋め込みのシーケンスとして表現される。テキストエンコーダも同様に、対応するテキスト入力を処理する。両方のシーケンスは、事前学習済みの凍結された言語モデルを促すためのプレフィックスとして結合される。Pengiの統一されたアーキテクチャにより、追加のファインチューニングやタスク固有の拡張なしに、オープンエンドタスクとクローズドエンドタスクの両方を実行できる。22の下流タスクで評価した結果、本アプローチはそのうちのいくつかで最先端の性能を達成した。我々の結果は、言語モデルと音声モデルを接続することが、汎用音声理解に向けた重要な一歩であることを示している。

English

In the domain of audio processing, Transfer Learning has facilitated the rise of Self-Supervised Learning and Zero-Shot Learning techniques. These approaches have led to the development of versatile models capable of tackling a wide array of tasks, while delivering state-of-the-art performance. However, current models inherently lack the capacity to produce the requisite language for open-ended tasks, such as Audio Captioning or Audio Question & Answering. We introduce Pengi, a novel Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks. It takes as input, an audio recording, and text, and generates free-form text as output. The input audio is represented as a sequence of continuous embeddings by an audio encoder. A text encoder does the same for the corresponding text input. Both sequences are combined as a prefix to prompt a pre-trained frozen language model. The unified architecture of Pengi enables open-ended tasks and close-ended tasks without any additional fine-tuning or task-specific extensions. When evaluated on 22 downstream tasks, our approach yields state-of-the-art performance in several of them. Our results show that connecting language models with audio models is a major step towards general-purpose audio understanding

Pengi: 音声タスクのための音声言語モデル

Pengi: An Audio Language Model for Audio Tasks

要旨

Support