利用语音识别能力引导大型语言模型

摘要

大型语言模型已经证明自己非常灵活，能够解决各种生成任务，比如抽象摘要和开放式问答。在本文中，我们通过直接连接一个小型音频编码器来扩展LLM的功能，使其能够执行语音识别。通过直接在文本令牌嵌入前附加一系列音频嵌入，LLM可以转换为自动语音识别（ASR）系统，并且可以像其文本对应物一样使用。在多语言LibriSpeech（MLS）上的实验表明，将一个conformer编码器整合到开源的LLaMA-7B中，使其比单语基线表现提高了18%，并且能够执行多语言语音识别，尽管LLaMA主要在英文文本上进行训练。此外，我们进行消融研究，以调查LLM在训练期间是否可以完全冻结以保持其原始功能，扩展音频编码器，并增加音频编码器的跨步以生成更少的嵌入。这些研究结果表明，即使LLM被冻结，或者在音频编码器中使用接近1秒的跨步生成更少的嵌入，多语言ASR也是可能的，从而使LLM能够处理长形式音频。

English

Large language models have proven themselves highly flexible, able to solve a wide range of generative tasks, such as abstractive summarization and open-ended question answering. In this paper we extend the capabilities of LLMs by directly attaching a small audio encoder allowing it to perform speech recognition. By directly prepending a sequence of audial embeddings to the text token embeddings, the LLM can be converted to an automatic speech recognition (ASR) system, and be used in the exact same manner as its textual counterpart. Experiments on Multilingual LibriSpeech (MLS) show that incorporating a conformer encoder into the open sourced LLaMA-7B allows it to outperform monolingual baselines by 18% and perform multilingual speech recognition despite LLaMA being trained overwhelmingly on English text. Furthermore, we perform ablation studies to investigate whether the LLM can be completely frozen during training to maintain its original capabilities, scaling up the audio encoder, and increasing the audio encoder striding to generate fewer embeddings. The results from these studies show that multilingual ASR is possible even when the LLM is frozen or when strides of almost 1 second are used in the audio encoder opening up the possibility for LLMs to operate on long-form audio.

利用语音识别能力引导大型语言模型

Prompting Large Language Models with Speech Recognition Abilities

摘要

Support