大規模言語モデルを用いたエンドツーエンド音声認識の文脈化

要旨

近年、大規模言語モデル（LLMs）はその卓越した性能と汎化能力から、研究コミュニティにおいて大きな注目を集めています。本論文では、LLMsを組み込んだ音声認識モデルを文脈化するための新たな手法を紹介します。私たちのアプローチは、事前学習済みのLLMに基づいて、音声認識をマルチモーダルな言語モデリングタスクとして定式化します。システムは、オーディオ特徴量と、必要に応じて文脈情報としてのテキストトークンを受け取り、デコーダのみの方式で文字起こしを完成させるように訓練されます。その結果、システムは訓練中に非構造化された文脈情報を活用する方法を暗黙的に学習するよう促されます。実験結果から、追加のテキスト文脈が提供された場合に6%のWER（単語誤り率）改善が示されました。さらに、私たちの手法は競争力のある性能を発揮し、ベースラインの文脈化されたRNN-Tシステムと比較して、全体で7.5%、希少語においては17%のWER改善を達成しました。このベースラインシステムは、25倍以上の大規模な音声データセットで訓練されています。全体として、アダプターを介して少数の学習可能なパラメータを追加するだけで、事前学習済みのLLMに文脈化された音声認識能力を付与しつつ、テキストのみの入力機能を維持できることを実証しました。

English

In recent years, Large Language Models (LLMs) have garnered significant attention from the research community due to their exceptional performance and generalization capabilities. In this paper, we introduce a novel method for contextualizing speech recognition models incorporating LLMs. Our approach casts speech recognition as a mixed-modal language modeling task based on a pretrained LLM. We provide audio features, along with optional text tokens for context, to train the system to complete transcriptions in a decoder-only fashion. As a result, the system is implicitly incentivized to learn how to leverage unstructured contextual information during training. Our empirical results demonstrate a significant improvement in performance, with a 6% WER reduction when additional textual context is provided. Moreover, we find that our method performs competitively and improve by 7.5% WER overall and 17% WER on rare words against a baseline contextualized RNN-T system that has been trained on more than twenty five times larger speech dataset. Overall, we demonstrate that by only adding a handful number of trainable parameters via adapters, we can unlock contextualized speech recognition capability for the pretrained LLM while keeping the same text-only input functionality.

大規模言語モデルを用いたエンドツーエンド音声認識の文脈化

End-to-End Speech Recognition Contextualization with Large Language Models

要旨

Support