利用大型语言模型进行端到端语音识别的上下文化

摘要

近年来，由于其出色的性能和泛化能力，大型语言模型（LLMs）引起了研究界的广泛关注。在本文中，我们介绍了一种新颖的方法，用于将LLMs纳入上下文化语音识别模型中。我们的方法将语音识别视为基于预训练LLMs的混合模态语言建模任务。我们提供音频特征以及可选的文本标记作为上下文，以训练系统以仅解码器方式完成转录。因此，系统会在训练过程中被隐式激励学习如何利用非结构化的上下文信息。我们的实证结果表明，在提供额外文本上下文时，性能显著提高，WER降低了6%。此外，我们发现我们的方法在整体上表现竞争力强，并在罕见词上相对基准上下文化RNN-T系统提高了7.5%的WER，后者在训练时使用了超过25倍大的语音数据集。总体而言，我们证明通过添加少量可训练参数适配器，我们可以为预训练LLMs释放上下文化语音识别能力，同时保持相同的仅文本输入功能。

English

In recent years, Large Language Models (LLMs) have garnered significant attention from the research community due to their exceptional performance and generalization capabilities. In this paper, we introduce a novel method for contextualizing speech recognition models incorporating LLMs. Our approach casts speech recognition as a mixed-modal language modeling task based on a pretrained LLM. We provide audio features, along with optional text tokens for context, to train the system to complete transcriptions in a decoder-only fashion. As a result, the system is implicitly incentivized to learn how to leverage unstructured contextual information during training. Our empirical results demonstrate a significant improvement in performance, with a 6% WER reduction when additional textual context is provided. Moreover, we find that our method performs competitively and improve by 7.5% WER overall and 17% WER on rare words against a baseline contextualized RNN-T system that has been trained on more than twenty five times larger speech dataset. Overall, we demonstrate that by only adding a handful number of trainable parameters via adapters, we can unlock contextualized speech recognition capability for the pretrained LLM while keeping the same text-only input functionality.

利用大型语言模型进行端到端语音识别的上下文化

End-to-End Speech Recognition Contextualization with Large Language Models

摘要

Support