WhisTLE: 사전 학습된 음성 인식 트랜스포머를 위한 심층 지도, 텍스트 전용 도메인 적응

초록

Whisper와 같은 사전 학습된 자동 음성 인식(ASR) 모델은 우수한 성능을 보이지만, 보이지 않는 어휘와 표현을 처리하기 위해서는 도메인 적응이 여전히 필요합니다. 많은 실제 환경에서는 음성 데이터를 수집하는 것이 비현실적이어서 텍스트만을 이용한 적응이 필수적입니다. 본 연구에서는 사전 학습된 인코더-디코더 ASR 모델을 위한 심층 감독 방식의 텍스트 전용 적응 방법인 WhisTLE을 제안합니다. WhisTLE은 텍스트로부터 인코더 출력을 모델링하기 위해 변분 오토인코더(VAE)를 학습하고, 학습된 텍스트-잠재 인코더를 사용하여 디코더를 미세 조정하며, 선택적으로 텍스트-음성 변환(TTS) 적응과 결합할 수 있습니다. 추론 단계에서는 원래의 인코더가 복원되어 추가적인 런타임 비용이 발생하지 않습니다. 4개의 도메인 외 데이터셋과 4개의 ASR 모델을 대상으로 한 실험에서, WhisTLE은 TTS와 결합했을 때 TTS 전용 적응 대비 단어 오류율(WER)을 12.3% 상대적으로 감소시켰으며, 32개 시나리오 중 27개에서 모든 비-WhisTLE 베이스라인을 능가하는 성능을 보였습니다.

English

Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen vocabulary and parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four out-of-domain datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by 12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines in 27 of 32 scenarios.

WhisTLE: 사전 학습된 음성 인식 트랜스포머를 위한 심층 지도, 텍스트 전용 도메인 적응

WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers

초록

Support