음성 모델에서 대문자화 및 발화 순서 예측을 위한 텍스트 주입

초록

자동 음성 인식(ASR)을 위한 텍스트 주입(text injection)은 짝지어진 오디오-텍스트 데이터를 보완하기 위해 짝지어지지 않은 텍스트 전용 데이터를 사용하는 방법으로, 단어 오류율 개선에 유망한 성과를 보여왔습니다. 본 연구에서는 종단 간(end-to-end) 모델이 종종 수행하는 비 ASR 작업인 보조 작업에 텍스트 주입을 활용하는 방안을 검토합니다. 이 연구에서는 두 가지 보조 작업을 수행하는 ASR 모델을 훈련하기 위해 텍스트 주입 알고리즘으로 JEIT(Joint End-to-End and Internal Language Model Training)를 사용합니다. 첫 번째 작업은 비정규화 작업인 대문자화(capitalization)이며, 두 번째 작업은 디지털 어시스턴트 상호작용에서 사용자의 대화 턴이 완료되었는지 여부를 판단하려는 턴 전환 예측(turn-taking prediction)입니다. 우리는 텍스트 주입 방법이 롱테일 데이터에 대한 대문자화 성능을 향상시키고 턴 전환 탐색 재현율을 개선한다는 결과를 보여줍니다.

English

Text injection for automatic speech recognition (ASR), wherein unpaired text-only data is used to supplement paired audio-text data, has shown promising improvements for word error rate. This study examines the use of text injection for auxiliary tasks, which are the non-ASR tasks often performed by an E2E model. In this work, we use joint end-to-end and internal language model training (JEIT) as our text injection algorithm to train an ASR model which performs two auxiliary tasks. The first is capitalization, which is a de-normalization task. The second is turn-taking prediction, which attempts to identify whether a user has completed their conversation turn in a digital assistant interaction. We show results demonstrating that our text injection method boosts capitalization performance for long-tail data, and improves turn-taking detection recall.

음성 모델에서 대문자화 및 발화 순서 예측을 위한 텍스트 주입

Text Injection for Capitalization and Turn-Taking Prediction in Speech Models

초록

Support