大規模言語モデルを用いた長文データのリスコアリング

要旨

本研究では、大規模言語モデル（LLM）がYouTube動画の自動音声認識（ASR）に与える影響を調査します。YouTube動画は長文ASRのソースとして使用されます。米国英語（en-us）とコードスイッチングされたインド英語（en-in）の長文ASRテストセットにおいて、最大8％の相対的な単語誤り率（WER）の低減を示し、最大エントロピーに基づく言語モデルを使用した強力なファーストパスベースラインと比較して、重要語誤り率（STER）では最大30％の相対的な低減を達成しました。適切な（非木構造の）有向グラフトポロジーを持つラティスを生成し、前のセグメントの1-best仮説からコンテキストを引き継ぐ改良されたラティス処理は、LLMを用いたリスコアリングにおいて大きな成果をもたらします。また、C4のような大量の利用可能なデータで訓練されたLLMと従来のニューラル言語モデルを組み合わせることで、性能向上が加算的であり、最大エントロピー言語モデルを使用した強力なファーストパスベースラインを大幅に上回ることも明らかになりました。

English

In this work, we study the impact of Large-scale Language Models (LLM) on Automated Speech Recognition (ASR) of YouTube videos, which we use as a source for long-form ASR. We demonstrate up to 8\% relative reduction in Word Error Eate (WER) on US English (en-us) and code-switched Indian English (en-in) long-form ASR test sets and a reduction of up to 30\% relative on Salient Term Error Rate (STER) over a strong first-pass baseline that uses a maximum-entropy based language model. Improved lattice processing that results in a lattice with a proper (non-tree) digraph topology and carrying context from the 1-best hypothesis of the previous segment(s) results in significant wins in rescoring with LLMs. We also find that the gains in performance from the combination of LLMs trained on vast quantities of available data (such as C4) and conventional neural LMs is additive and significantly outperforms a strong first-pass baseline with a maximum entropy LM.

大規模言語モデルを用いた長文データのリスコアリング

Large-scale Language Model Rescoring on Long-form Data

要旨

Support