ChatPaper.aiChatPaper

利用大型語言模型進行端對端語音識別情境化

End-to-End Speech Recognition Contextualization with Large Language Models

September 19, 2023
作者: Egor Lakomkin, Chunyang Wu, Yassir Fathullah, Ozlem Kalinli, Michael L. Seltzer, Christian Fuegen
cs.AI

摘要

近年來,大型語言模型(LLMs)由於其出色的性能和泛化能力,受到研究界的廣泛關注。在本文中,我們介紹了一種新方法,用於將LLMs納入上下文的語音識別模型中。我們的方法將語音識別視為一種基於預訓練LLM的混合模態語言建模任務。我們提供音頻特徵,以及可選的文本標記來訓練系統以解碼器方式完成轉錄。因此,系統在訓練過程中被隱式激勵學習如何利用非結構化的上下文信息。我們的實驗結果表明,在提供額外文本上下文時,性能顯著提高,WER減少了6%。此外,我們發現我們的方法在整體性能上競爭力強,對於罕見詞語的WER提高了17%,相對於基準上下文化的RNN-T系統,在訓練時使用了超過25倍大的語音數據集。總的來說,我們證明通過添加少量可訓練參數透過適配器,我們可以為預訓練的LLM解鎖上下文化的語音識別能力,同時保持相同的僅文本輸入功能。
English
In recent years, Large Language Models (LLMs) have garnered significant attention from the research community due to their exceptional performance and generalization capabilities. In this paper, we introduce a novel method for contextualizing speech recognition models incorporating LLMs. Our approach casts speech recognition as a mixed-modal language modeling task based on a pretrained LLM. We provide audio features, along with optional text tokens for context, to train the system to complete transcriptions in a decoder-only fashion. As a result, the system is implicitly incentivized to learn how to leverage unstructured contextual information during training. Our empirical results demonstrate a significant improvement in performance, with a 6% WER reduction when additional textual context is provided. Moreover, we find that our method performs competitively and improve by 7.5% WER overall and 17% WER on rare words against a baseline contextualized RNN-T system that has been trained on more than twenty five times larger speech dataset. Overall, we demonstrate that by only adding a handful number of trainable parameters via adapters, we can unlock contextualized speech recognition capability for the pretrained LLM while keeping the same text-only input functionality.
PDF101December 15, 2024