在轉換器中使用統一的語音文本表示進行純文本領域適應

摘要

在端到端（E2E）語音識別中，僅使用文本語料庫進行領域適應是具有挑戰性的。通過從文本合成音頻進行適應是耗費資源的。我們提出了一種方法，即學習統一的語音-文本表示在Conformer Transducer（USTR-CT）中，以便使用僅文本語料庫進行快速領域適應。與先前的文本圖方法不同，我們的工作引入了額外的文本編碼器來學習文本表示，在推斷時將其移除，因此不需要對線上部署進行修改。為了提高適應效率，我們還探索了單步和多步適應。將LibriSpeech適應到SPGISpeech的實驗表明，所提出的方法將目標領域的詞錯誤率（WER）相對降低了44％，優於TTS方法和文本圖方法。同時，顯示了所提出的方法可以與內部語言模型估計（ILME）結合以進一步提高性能。

English

Domain adaptation using text-only corpus is challenging in end-to-end(E2E) speech recognition. Adaptation by synthesizing audio from text through TTS is resource-consuming. We present a method to learn Unified Speech-Text Representation in Conformer Transducer(USTR-CT) to enable fast domain adaptation using the text-only corpus. Different from the previous textogram method, an extra text encoder is introduced in our work to learn text representation and is removed during inference, so there is no modification for online deployment. To improve the efficiency of adaptation, single-step and multi-step adaptations are also explored. The experiments on adapting LibriSpeech to SPGISpeech show the proposed method reduces the word error rate(WER) by relatively 44% on the target domain, which is better than those of TTS method and textogram method. Also, it is shown the proposed method can be combined with internal language model estimation(ILME) to further improve the performance.

在轉換器中使用統一的語音文本表示進行純文本領域適應

Text-only Domain Adaptation using Unified Speech-Text Representation in Transducer

摘要

Support