WhisTLE：面向预训练语音识别Transformer的深度监督纯文本域适应方法

摘要

诸如Whisper等预训练自动语音识别（ASR）模型虽表现优异，但仍需领域适应以应对未见词汇及表达方式。在许多实际场景中，收集语音数据并不现实，因此仅依赖文本的适应成为必要。我们提出了WhisTLE，一种深度监督、仅基于文本的预训练编码器-解码器ASR模型适应方法。WhisTLE通过训练一个变分自编码器（VAE）来从文本建模编码器输出，并利用学习到的文本到潜在编码器对解码器进行微调，可选择性结合文本到语音（TTS）适应。在推理阶段，原始编码器得以恢复，不增加额外运行时成本。在四个跨领域数据集和四种ASR模型上，结合TTS的WhisTLE相较于仅使用TTS的适应方法，将词错误率（WER）相对降低了12.3%，并在32种场景中的27种情况下超越了所有非WhisTLE基线方法。

English

Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen vocabulary and parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four out-of-domain datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by 12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines in 27 of 32 scenarios.

WhisTLE：面向预训练语音识别Transformer的深度监督纯文本域适应方法

WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers

摘要

Support