다국어 및 완전 비자기회귀 ASR과 대형 언어 모델 융합: 포괄적 연구

초록

대형 모델 시대에서 디코딩의 자기회귀적 특성은 종종 지연 시간을 주요 병목 현상으로 초래합니다. 본 연구에서는 가속기 하드웨어의 병렬화 능력을 효과적으로 활용하는 비자기회귀적 LM-융합 ASR 시스템을 제안합니다. 우리의 접근 방식은 Universal Speech Model(USM)과 PaLM 2 언어 모델을 세그먼트별 채점 모드로 결합하여, FLEURS 데이터셋에서 평균 10.8%, YouTube 자막 생성에서 3.6%의 상대적 WER(Word Error Rate) 개선을 달성했습니다. 또한, 본 연구는 LLM 크기, 컨텍스트 길이, 어휘 크기, 융합 방법론과 같은 주요 매개변수를 분석하는 포괄적인 절제 연구를 수행했습니다. 예를 들어, 128M에서 340B 파라미터에 이르는 LLM 크기가 ASR 성능에 미치는 영향을 탐구했습니다. 이 연구는 실용적인 대규모 LM-융합 음성 인식 시스템의 효과에 영향을 미치는 요인에 대한 유용한 통찰을 제공합니다.

English

In the era of large models, the autoregressive nature of decoding often results in latency serving as a significant bottleneck. We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. Our approach combines the Universal Speech Model (USM) and the PaLM 2 language model in per-segment scoring mode, achieving an average relative WER improvement across all languages of 10.8% on FLEURS and 3.6% on YouTube captioning. Furthermore, our comprehensive ablation study analyzes key parameters such as LLM size, context length, vocabulary size, fusion methodology. For instance, we explore the impact of LLM size ranging from 128M to 340B parameters on ASR performance. This study provides valuable insights into the factors influencing the effectiveness of practical large-scale LM-fused speech recognition systems.

다국어 및 완전 비자기회귀 ASR과 대형 언어 모델 융합: 포괄적 연구

Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

초록

Support