利用大型語言模型進行端對端語音識別情境化

摘要

近年來，大型語言模型（LLMs）由於其出色的性能和泛化能力，受到研究界的廣泛關注。在本文中，我們介紹了一種新方法，用於將LLMs納入上下文的語音識別模型中。我們的方法將語音識別視為一種基於預訓練LLM的混合模態語言建模任務。我們提供音頻特徵，以及可選的文本標記來訓練系統以解碼器方式完成轉錄。因此，系統在訓練過程中被隱式激勵學習如何利用非結構化的上下文信息。我們的實驗結果表明，在提供額外文本上下文時，性能顯著提高，WER減少了6%。此外，我們發現我們的方法在整體性能上競爭力強，對於罕見詞語的WER提高了17%，相對於基準上下文化的RNN-T系統，在訓練時使用了超過25倍大的語音數據集。總的來說，我們證明通過添加少量可訓練參數透過適配器，我們可以為預訓練的LLM解鎖上下文化的語音識別能力，同時保持相同的僅文本輸入功能。

English

In recent years, Large Language Models (LLMs) have garnered significant attention from the research community due to their exceptional performance and generalization capabilities. In this paper, we introduce a novel method for contextualizing speech recognition models incorporating LLMs. Our approach casts speech recognition as a mixed-modal language modeling task based on a pretrained LLM. We provide audio features, along with optional text tokens for context, to train the system to complete transcriptions in a decoder-only fashion. As a result, the system is implicitly incentivized to learn how to leverage unstructured contextual information during training. Our empirical results demonstrate a significant improvement in performance, with a 6% WER reduction when additional textual context is provided. Moreover, we find that our method performs competitively and improve by 7.5% WER overall and 17% WER on rare words against a baseline contextualized RNN-T system that has been trained on more than twenty five times larger speech dataset. Overall, we demonstrate that by only adding a handful number of trainable parameters via adapters, we can unlock contextualized speech recognition capability for the pretrained LLM while keeping the same text-only input functionality.

利用大型語言模型進行端對端語音識別情境化

End-to-End Speech Recognition Contextualization with Large Language Models

摘要

Support