利用語音識別能力引導大型語言模型

摘要

大型語言模型已被證明具有高度靈活性，能夠解決各種生成任務，如抽象摘要和開放式問答。本文通過直接附加一個小型音頻編碼器，擴展了LLM的能力，使其能夠執行語音識別。通過將一系列聲音嵌入直接添加到文本標記嵌入之前，LLM可以轉換為自動語音識別（ASR）系統，並且可以像其文本對應物一樣使用。對多語種LibriSpeech（MLS）的實驗表明，將conformer編碼器整合到開源的LLaMA-7B中，使其表現優於單語基準線18％，實現多語種語音識別，盡管LLaMA主要在英文文本上進行訓練。此外，我們進行消融研究，以探討LLM在訓練期間是否可以完全凍結以保持其原始功能，擴展音頻編碼器，並增加音頻編碼器的跨步以生成更少的嵌入。這些研究結果顯示，即使在凍結LLM或在音頻編碼器中使用接近1秒的跨步的情況下，多語種ASR也是可能的，這為LLM在長格式音頻上運作打開了可能性。

English

Large language models have proven themselves highly flexible, able to solve a wide range of generative tasks, such as abstractive summarization and open-ended question answering. In this paper we extend the capabilities of LLMs by directly attaching a small audio encoder allowing it to perform speech recognition. By directly prepending a sequence of audial embeddings to the text token embeddings, the LLM can be converted to an automatic speech recognition (ASR) system, and be used in the exact same manner as its textual counterpart. Experiments on Multilingual LibriSpeech (MLS) show that incorporating a conformer encoder into the open sourced LLaMA-7B allows it to outperform monolingual baselines by 18% and perform multilingual speech recognition despite LLaMA being trained overwhelmingly on English text. Furthermore, we perform ablation studies to investigate whether the LLM can be completely frozen during training to maintain its original capabilities, scaling up the audio encoder, and increasing the audio encoder striding to generate fewer embeddings. The results from these studies show that multilingual ASR is possible even when the LLM is frozen or when strides of almost 1 second are used in the audio encoder opening up the possibility for LLMs to operate on long-form audio.

利用語音識別能力引導大型語言模型

Prompting Large Language Models with Speech Recognition Abilities

摘要

Support