마스킹된 오디오 텍스트 인코더는 효과적인 다중 모달 리스코어입니다.

초록

마스크드 언어 모델(Masked Language Models, MLMs)은 자동 음성 인식(Automatic Speech Recognition, ASR) 시스템에서 두 번째 단계의 재점수화(second-pass rescoring)에 효과적인 것으로 입증되었습니다. 본 연구에서는 MLM의 입력 공간에 음향적 표현을 통합한 다중 모달 마스크드 언어 모델 재점수화기인 Masked Audio Text Encoder(MATE)를 제안합니다. 우리는 대조 학습(contrastive learning)을 채택하여 모달리티 간의 공유 표현을 학습함으로써 효과적으로 정렬합니다. 다중 모달 재점수화기를 사용하면 대상 도메인 데이터가 없는 경우 ASR 시스템의 도메인 일반화에 유리함을 보여줍니다. MATE는 텍스트 전용 기준선 대비 도메인 내 데이터셋에서 4%~16%, 도메인 외 데이터셋에서 3%~7%의 단어 오류율(Word Error Rate, WER)을 감소시켰습니다. 또한, 매우 제한된 양의 학습 데이터(0.8시간)로도 MATE는 첫 번째 단계 기준선 대비 8%~23%의 WER 감소를 달성했습니다.

English

Masked Language Models (MLMs) have proven to be effective for second-pass rescoring in Automatic Speech Recognition (ASR) systems. In this work, we propose Masked Audio Text Encoder (MATE), a multi-modal masked language model rescorer which incorporates acoustic representations into the input space of MLM. We adopt contrastive learning for effectively aligning the modalities by learning shared representations. We show that using a multi-modal rescorer is beneficial for domain generalization of the ASR system when target domain data is unavailable. MATE reduces word error rate (WER) by 4%-16% on in-domain, and 3%-7% on out-of-domain datasets, over the text-only baseline. Additionally, with very limited amount of training data (0.8 hours), MATE achieves a WER reduction of 8%-23% over the first-pass baseline.

마스킹된 오디오 텍스트 인코더는 효과적인 다중 모달 리스코어입니다.

Masked Audio Text Encoders are Effective Multi-Modal Rescorers

초록

Support