去噪語言模型:突破語音識別錯誤修正模型的極限
Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition
May 24, 2024
作者: Zijin Gu, Tatiana Likhomanenko, He Bai, Erik McDermott, Ronan Collobert, Navdeep Jaitly
cs.AI
摘要
語言模型(LMs)長期以來被用來改善自動語音識別(ASR)系統的結果,但它們對ASR系統的錯誤毫不知情。錯誤校正模型旨在修正ASR錯誤,然而,由於缺乏監督式訓練數據,它們在傳統LMs上顯示出的改進有限。本文介紹了去噪LM(DLM),這是一種經過大量合成數據訓練的規模化錯誤校正模型,明顯超越先前的嘗試,同時實現了新的最先進的ASR性能。我們使用文本轉語音(TTS)系統合成音頻,將其餵入ASR系統以生成含噪假設,然後將其與原始文本配對以訓練DLM。DLM具有幾個關鍵要素:(i)放大模型和數據;(ii)使用多說話者TTS系統;(iii)結合多種噪聲增強策略;以及(iv)新的解碼技術。憑藉Transformer-CTC ASR,在Librispeech的test-clean上達到1.5%的字錯誤率(WER),在test-other上達到3.3%的WER,據我們所知,這是在不使用外部音頻數據的情況下報告的最佳數字,甚至與使用外部音頻數據的自監督方法相匹敵。此外,單個DLM適用於不同的ASR,遠遠超越基於傳統LM的波束搜索重評分的性能。這些結果表明,經過適當研究的錯誤校正模型有可能取代傳統LMs,為ASR系統的新準確性水平打開大門。
English
Language models (LMs) have long been used to improve results of automatic
speech recognition (ASR) systems, but they are unaware of the errors that ASR
systems make. Error correction models are designed to fix ASR errors, however,
they showed little improvement over traditional LMs mainly due to the lack of
supervised training data. In this paper, we present Denoising LM (DLM), which
is a scaled error correction model trained with vast amounts of
synthetic data, significantly exceeding prior attempts meanwhile achieving new
state-of-the-art ASR performance. We use text-to-speech (TTS) systems to
synthesize audio, which is fed into an ASR system to produce noisy hypotheses,
which are then paired with the original texts to train the DLM. DLM has several
key ingredients: (i) up-scaled model and data; (ii) usage of
multi-speaker TTS systems; (iii) combination of multiple noise augmentation
strategies; and (iv) new decoding techniques. With a Transformer-CTC ASR, DLM
achieves 1.5% word error rate (WER) on test-clean and 3.3% WER on
test-other on Librispeech, which to our knowledge are the best
reported numbers in the setting where no external audio data are used and even
match self-supervised methods which use external audio data. Furthermore, a
single DLM is applicable to different ASRs, and greatly surpassing the
performance of conventional LM based beam-search rescoring. These results
indicate that properly investigated error correction models have the potential
to replace conventional LMs, holding the key to a new level of accuracy in ASR
systems.Summary
AI-Generated Summary