テキストのみのドメイン適応におけるトランスデューサーを用いた音声-テキスト統一表現

要旨

テキストのみのコーパスを用いたドメイン適応は、エンドツーエンド（E2E）音声認識において課題となっています。TTS（Text-to-Speech）を通じてテキストから音声を合成する方法はリソースを消費します。本論文では、Conformer Transducerを用いた統一音声-テキスト表現（USTR-CT）を学習し、テキストのみのコーパスを用いた高速なドメイン適応を可能にする手法を提案します。従来のテキストグラム手法とは異なり、本手法ではテキスト表現を学習するための追加のテキストエンコーダを導入し、推論時にはこれを除去するため、オンライン展開に変更を加える必要がありません。適応効率を向上させるため、シングルステップおよびマルチステップの適応手法も検討しています。LibriSpeechからSPGISpeechへの適応実験では、提案手法がターゲットドメインにおいて単語誤り率（WER）を相対的に44%削減し、TTS手法やテキストグラム手法よりも優れた結果を示しました。また、提案手法は内部言語モデル推定（ILME）と組み合わせることで、さらに性能を向上させることができることも示されています。

English

Domain adaptation using text-only corpus is challenging in end-to-end(E2E) speech recognition. Adaptation by synthesizing audio from text through TTS is resource-consuming. We present a method to learn Unified Speech-Text Representation in Conformer Transducer(USTR-CT) to enable fast domain adaptation using the text-only corpus. Different from the previous textogram method, an extra text encoder is introduced in our work to learn text representation and is removed during inference, so there is no modification for online deployment. To improve the efficiency of adaptation, single-step and multi-step adaptations are also explored. The experiments on adapting LibriSpeech to SPGISpeech show the proposed method reduces the word error rate(WER) by relatively 44% on the target domain, which is better than those of TTS method and textogram method. Also, it is shown the proposed method can be combined with internal language model estimation(ILME) to further improve the performance.

テキストのみのドメイン適応におけるトランスデューサーを用いた音声-テキスト統一表現

Text-only Domain Adaptation using Unified Speech-Text Representation in Transducer

要旨

Support