WhisTLE:面向预训练语音识别Transformer的深度监督纯文本域适应方法
WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers
September 12, 2025
作者: Akshat Pandey, Karun Kumar, Raphael Tang
cs.AI
摘要
诸如Whisper等预训练自动语音识别(ASR)模型虽表现优异,但仍需领域适应以应对未见词汇及表达方式。在许多实际场景中,收集语音数据并不现实,因此仅依赖文本的适应成为必要。我们提出了WhisTLE,一种深度监督、仅基于文本的预训练编码器-解码器ASR模型适应方法。WhisTLE通过训练一个变分自编码器(VAE)来从文本建模编码器输出,并利用学习到的文本到潜在编码器对解码器进行微调,可选择性结合文本到语音(TTS)适应。在推理阶段,原始编码器得以恢复,不增加额外运行时成本。在四个跨领域数据集和四种ASR模型上,结合TTS的WhisTLE相较于仅使用TTS的适应方法,将词错误率(WER)相对降低了12.3%,并在32种场景中的27种情况下超越了所有非WhisTLE基线方法。
English
Pretrained automatic speech recognition (ASR) models such as Whisper perform
well but still need domain adaptation to handle unseen vocabulary and parlance.
In many real-world settings, collecting speech data is impractical,
necessitating text-only adaptation. We propose WhisTLE, a deeply supervised,
text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE
trains a variational autoencoder (VAE) to model encoder outputs from text and
fine-tunes the decoder using the learned text-to-latent encoder, optionally
combined with text-to-speech (TTS) adaptation. At inference, the original
encoder is restored, incurring no extra runtime cost. Across four out-of-domain
datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by
12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines
in 27 of 32 scenarios.