ノイズ除去言語モデル：音声認識のための誤り訂正モデルの限界に挑む

要旨

言語モデル（LM）は長年にわたり自動音声認識（ASR）システムの結果を改善するために使用されてきたが、ASRシステムが犯すエラーを認識していない。エラー修正モデルはASRのエラーを修正するために設計されているが、教師付き訓練データの不足により、従来のLMを大きく上回る改善はほとんど見られなかった。本論文では、大量の合成データを用いて訓練されたスケーラブルなエラー修正モデルであるDenoising LM（DLM）を提案し、従来の試みを大幅に上回りつつ、新たな最先端のASR性能を達成する。テキスト音声合成（TTS）システムを使用して音声を合成し、それをASRシステムに入力してノイズの多い仮説を生成し、それらを元のテキストとペアにしてDLMを訓練する。DLMにはいくつかの重要な要素がある：（i）スケールアップされたモデルとデータ、（ii）複数話者TTSシステムの使用、（iii）複数のノイズ増強戦略の組み合わせ、（iv）新しいデコーディング技術。Transformer-CTC ASRを使用して、DLMはLibrispeechのtest-cleanで1.5%の単語誤り率（WER）、test-otherで3.3%のWERを達成し、これは我々の知る限り、外部音声データを使用しない設定での最高の報告値であり、外部音声データを使用する自己教師あり手法と同等の性能を示す。さらに、単一のDLMは異なるASRに適用可能であり、従来のLMに基づくビームサーチ再スコアリングの性能を大幅に上回る。これらの結果は、適切に調査されたエラー修正モデルが従来のLMに取って代わる可能性があり、ASRシステムの新たな精度レベルへの鍵を握っていることを示している。

English

Language models (LMs) have long been used to improve results of automatic speech recognition (ASR) systems, but they are unaware of the errors that ASR systems make. Error correction models are designed to fix ASR errors, however, they showed little improvement over traditional LMs mainly due to the lack of supervised training data. In this paper, we present Denoising LM (DLM), which is a scaled error correction model trained with vast amounts of synthetic data, significantly exceeding prior attempts meanwhile achieving new state-of-the-art ASR performance. We use text-to-speech (TTS) systems to synthesize audio, which is fed into an ASR system to produce noisy hypotheses, which are then paired with the original texts to train the DLM. DLM has several key ingredients: (i) up-scaled model and data; (ii) usage of multi-speaker TTS systems; (iii) combination of multiple noise augmentation strategies; and (iv) new decoding techniques. With a Transformer-CTC ASR, DLM achieves 1.5% word error rate (WER) on test-clean and 3.3% WER on test-other on Librispeech, which to our knowledge are the best reported numbers in the setting where no external audio data are used and even match self-supervised methods which use external audio data. Furthermore, a single DLM is applicable to different ASRs, and greatly surpassing the performance of conventional LM based beam-search rescoring. These results indicate that properly investigated error correction models have the potential to replace conventional LMs, holding the key to a new level of accuracy in ASR systems.

ノイズ除去言語モデル：音声認識のための誤り訂正モデルの限界に挑む

Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition

要旨

Support