텍스트 전용 도메인 적응: 트랜스듀서에서 통합 음성-텍스트 표현 활용

초록

텍스트 전용 코퍼스를 사용한 도메인 적응은 종단 간(E2E) 음성 인식에서 어려운 과제입니다. TTS를 통해 텍스트에서 오디오를 합성하여 적응하는 방법은 자원 소모가 큽니다. 본 논문에서는 텍스트 전용 코퍼스를 사용한 빠른 도메인 적응을 가능하게 하는 통합 음성-텍스트 표현 학습 방법을 Conformer Transducer(USTR-CT)에 적용한 방법을 제시합니다. 기존의 텍스트로그램 방법과 달리, 본 연구에서는 텍스트 표현을 학습하기 위해 추가적인 텍스트 인코더를 도입하고 추론 시에는 이를 제거함으로써 온라인 배포에 대한 수정이 필요하지 않습니다. 적응 효율성을 높이기 위해 단일 단계 및 다단계 적응 방법도 탐구했습니다. LibriSpeech를 SPGISpeech에 적응시키는 실험에서 제안된 방법은 대상 도메인에서 단어 오류율(WER)을 상대적으로 44% 감소시켰으며, 이는 TTS 방법과 텍스트로그램 방법보다 우수한 성능을 보였습니다. 또한, 제안된 방법이 내부 언어 모델 추정(ILME)과 결합되어 성능을 더욱 향상시킬 수 있음을 보여줍니다.

English

Domain adaptation using text-only corpus is challenging in end-to-end(E2E) speech recognition. Adaptation by synthesizing audio from text through TTS is resource-consuming. We present a method to learn Unified Speech-Text Representation in Conformer Transducer(USTR-CT) to enable fast domain adaptation using the text-only corpus. Different from the previous textogram method, an extra text encoder is introduced in our work to learn text representation and is removed during inference, so there is no modification for online deployment. To improve the efficiency of adaptation, single-step and multi-step adaptations are also explored. The experiments on adapting LibriSpeech to SPGISpeech show the proposed method reduces the word error rate(WER) by relatively 44% on the target domain, which is better than those of TTS method and textogram method. Also, it is shown the proposed method can be combined with internal language model estimation(ILME) to further improve the performance.

텍스트 전용 도메인 적응: 트랜스듀서에서 통합 음성-텍스트 표현 활용

Text-only Domain Adaptation using Unified Speech-Text Representation in Transducer

초록

Support