在转录器中使用统一的语音文本表示进行文本领域自适应

摘要

在端到端（E2E）语音识别中，仅使用文本语料库进行领域自适应是具有挑战性的。通过文本合成语音的TTS进行自适应是资源密集型的。我们提出了一种学习统一语音-文本表示的Conformer Transducer(USTR-CT)方法，以实现使用仅文本语料库进行快速领域自适应。与先前的文本图方法不同，我们的工作引入了额外的文本编码器来学习文本表示，并在推断过程中将其移除，因此无需对在线部署进行修改。为了提高自适应的效率，我们还探讨了单步和多步自适应。将LibriSpeech自适应到SPGISpeech的实验表明，所提出的方法在目标领域将词错误率（WER）相对降低了44%，优于TTS方法和文本图方法。此外，结果表明所提出的方法可以与内部语言模型估计（ILME）结合，进一步提高性能。

English

Domain adaptation using text-only corpus is challenging in end-to-end(E2E) speech recognition. Adaptation by synthesizing audio from text through TTS is resource-consuming. We present a method to learn Unified Speech-Text Representation in Conformer Transducer(USTR-CT) to enable fast domain adaptation using the text-only corpus. Different from the previous textogram method, an extra text encoder is introduced in our work to learn text representation and is removed during inference, so there is no modification for online deployment. To improve the efficiency of adaptation, single-step and multi-step adaptations are also explored. The experiments on adapting LibriSpeech to SPGISpeech show the proposed method reduces the word error rate(WER) by relatively 44% on the target domain, which is better than those of TTS method and textogram method. Also, it is shown the proposed method can be combined with internal language model estimation(ILME) to further improve the performance.

在转录器中使用统一的语音文本表示进行文本领域自适应

Text-only Domain Adaptation using Unified Speech-Text Representation in Transducer

摘要

Support