WhisTLE: 事前学習済み音声認識トランスフォーマーのための深層監視型テキストのみのドメイン適応

要旨

事前学習済みの自動音声認識（ASR）モデル、例えばWhisperは高い性能を発揮するが、未見の語彙や言い回しを扱うためにはドメイン適応が必要である。多くの実世界の設定では、音声データの収集が非現実的であり、テキストのみの適応が求められる。本研究では、事前学習済みのエンコーダ-デコーダASRモデルに対する、深層監視型のテキストのみの適応手法であるWhisTLEを提案する。WhisTLEは、テキストからエンコーダ出力をモデル化するために変分自己符号化器（VAE）を訓練し、学習されたテキストから潜在表現へのエンコーダを用いてデコーダを微調整する。オプションとして、テキストから音声（TTS）への適応を組み合わせることも可能である。推論時には、元のエンコーダが復元され、追加の実行時コストが発生しない。4つのドメイン外データセットと4つのASRモデルを用いた実験において、WhisTLE with TTSは、TTSのみの適応と比較して単語誤り率（WER）を12.3%相対的に削減し、32のシナリオのうち27において全ての非WhisTLEベースラインを上回った。

English

Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen vocabulary and parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four out-of-domain datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by 12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines in 27 of 32 scenarios.

WhisTLE: 事前学習済み音声認識トランスフォーマーのための深層監視型テキストのみのドメイン適応

WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers

要旨

Support