去噪语言模型:拓展语音识别中纠错模型的极限
Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition
May 24, 2024
作者: Zijin Gu, Tatiana Likhomanenko, He Bai, Erik McDermott, Ronan Collobert, Navdeep Jaitly
cs.AI
摘要
语言模型(LMs)长期以来一直被用于改善自动语音识别(ASR)系统的结果,但它们并不知晓ASR系统所犯的错误。错误校正模型旨在修复ASR错误,然而,由于缺乏监督训练数据,它们在传统LMs上显示出的改进很少。在本文中,我们提出了去噪语言模型(DLM),这是一个经过大量合成数据训练的缩放错误校正模型,显著超越先前的尝试,同时实现了新的最先进的ASR性能。我们使用文本转语音(TTS)系统合成音频,将其输入ASR系统以生成带噪假设,然后将其与原始文本配对以训练DLM。DLM具有几个关键要素:(i)增强模型和数据;(ii)使用多说话人TTS系统;(iii)结合多种噪声增强策略;以及(iv)新的解码技术。通过Transformer-CTC ASR,DLM在Librispeech的test-clean上实现了1.5%的词错误率(WER),在test-other上实现了3.3%的WER,据我们所知,这些是在不使用外部音频数据的情况下报告的最佳数字,甚至与使用外部音频数据的自监督方法相匹敌。此外,单个DLM适用于不同的ASR,并大大超越基于传统LM的波束搜索重评分的性能。这些结果表明,经过适当调查的错误校正模型有潜力取代传统的LMs,在ASR系统中实现新的准确度水平。
English
Language models (LMs) have long been used to improve results of automatic
speech recognition (ASR) systems, but they are unaware of the errors that ASR
systems make. Error correction models are designed to fix ASR errors, however,
they showed little improvement over traditional LMs mainly due to the lack of
supervised training data. In this paper, we present Denoising LM (DLM), which
is a scaled error correction model trained with vast amounts of
synthetic data, significantly exceeding prior attempts meanwhile achieving new
state-of-the-art ASR performance. We use text-to-speech (TTS) systems to
synthesize audio, which is fed into an ASR system to produce noisy hypotheses,
which are then paired with the original texts to train the DLM. DLM has several
key ingredients: (i) up-scaled model and data; (ii) usage of
multi-speaker TTS systems; (iii) combination of multiple noise augmentation
strategies; and (iv) new decoding techniques. With a Transformer-CTC ASR, DLM
achieves 1.5% word error rate (WER) on test-clean and 3.3% WER on
test-other on Librispeech, which to our knowledge are the best
reported numbers in the setting where no external audio data are used and even
match self-supervised methods which use external audio data. Furthermore, a
single DLM is applicable to different ASRs, and greatly surpassing the
performance of conventional LM based beam-search rescoring. These results
indicate that properly investigated error correction models have the potential
to replace conventional LMs, holding the key to a new level of accuracy in ASR
systems.Summary
AI-Generated Summary