대규모 언어 모델을 활용한 장문 데이터 리스코어링

초록

본 연구에서는 대규모 언어 모델(LLM)이 YouTube 동영상의 자동 음성 인식(ASR)에 미치는 영향을 분석합니다. 여기서 YouTube 동영상을 장문형 ASR의 소스로 활용하였습니다. 실험 결과, 미국 영어(en-us)와 코드 스위칭이 포함된 인도 영어(en-in) 장문형 ASR 테스트 세트에서 단어 오류율(WER)이 최대 8% 상대적으로 감소하였으며, 최대 엔트로피 기반 언어 모델을 사용한 강력한 1차 통과 기준선 대비 주요 용어 오류율(STER)이 최대 30% 상대적으로 감소하였음을 입증했습니다. 개선된 격자 처리로 인해 적절한(비트리) 방향 그래프 토폴로지를 가진 격자가 생성되고 이전 세그먼트의 1-최적 가설에서 컨텍스트를 전달함으로써 LLM을 이용한 재점수화에서 상당한 성과를 거두었습니다. 또한, 방대한 양의 데이터(예: C4)로 훈련된 LLM과 기존의 신경망 언어 모델을 결합했을 때 성능 향상이 가산적이며, 최대 엔트로피 언어 모델을 사용한 강력한 1차 통과 기준선을 크게 능가한다는 사실을 발견했습니다.

English

In this work, we study the impact of Large-scale Language Models (LLM) on Automated Speech Recognition (ASR) of YouTube videos, which we use as a source for long-form ASR. We demonstrate up to 8\% relative reduction in Word Error Eate (WER) on US English (en-us) and code-switched Indian English (en-in) long-form ASR test sets and a reduction of up to 30\% relative on Salient Term Error Rate (STER) over a strong first-pass baseline that uses a maximum-entropy based language model. Improved lattice processing that results in a lattice with a proper (non-tree) digraph topology and carrying context from the 1-best hypothesis of the previous segment(s) results in significant wins in rescoring with LLMs. We also find that the gains in performance from the combination of LLMs trained on vast quantities of available data (such as C4) and conventional neural LMs is additive and significantly outperforms a strong first-pass baseline with a maximum entropy LM.

대규모 언어 모델을 활용한 장문 데이터 리스코어링

Large-scale Language Model Rescoring on Long-form Data

초록

Support