音声認識能力を備えた大規模言語モデルへのプロンプティング

要旨

大規模言語モデルはその高い柔軟性を証明し、抽象的な要約やオープンエンドの質問応答など、幅広い生成タスクを解決できることが示されています。本論文では、LLMの能力を拡張し、音声認識を可能にするために、小さな音声エンコーダを直接接続する手法を提案します。音声埋め込みのシーケンスをテキストトークン埋め込みの前に直接付加することで、LLMを自動音声認識（ASR）システムに変換し、テキスト処理と全く同じ方法で使用することができます。Multilingual LibriSpeech（MLS）での実験では、オープンソースのLLaMA-7BにConformerエンコーダを組み込むことで、単一言語ベースラインを18%上回り、LLaMAが主に英語テキストで訓練されているにもかかわらず、多言語音声認識を実現できることが示されました。さらに、LLMを完全に凍結して元の能力を維持できるかどうか、音声エンコーダをスケールアップするかどうか、音声エンコーダのストライドを増やして埋め込み数を減らすかどうかを調査するために、アブレーションスタディを実施しました。これらの研究の結果から、LLMが凍結されている場合や、音声エンコーダでほぼ1秒のストライドが使用されている場合でも、多言語ASRが可能であることが示され、LLMが長時間の音声を処理する可能性が開かれました。

English

Large language models have proven themselves highly flexible, able to solve a wide range of generative tasks, such as abstractive summarization and open-ended question answering. In this paper we extend the capabilities of LLMs by directly attaching a small audio encoder allowing it to perform speech recognition. By directly prepending a sequence of audial embeddings to the text token embeddings, the LLM can be converted to an automatic speech recognition (ASR) system, and be used in the exact same manner as its textual counterpart. Experiments on Multilingual LibriSpeech (MLS) show that incorporating a conformer encoder into the open sourced LLaMA-7B allows it to outperform monolingual baselines by 18% and perform multilingual speech recognition despite LLaMA being trained overwhelmingly on English text. Furthermore, we perform ablation studies to investigate whether the LLM can be completely frozen during training to maintain its original capabilities, scaling up the audio encoder, and increasing the audio encoder striding to generate fewer embeddings. The results from these studies show that multilingual ASR is possible even when the LLM is frozen or when strides of almost 1 second are used in the audio encoder opening up the possibility for LLMs to operate on long-form audio.

音声認識能力を備えた大規模言語モデルへのプロンプティング

Prompting Large Language Models with Speech Recognition Abilities

要旨

Support