NLE: 전사 편집 기반 비자회귀 대규모 언어 모델 음성 인식

초록

자동회귀(AR) LLM 기반 음성 인식 시스템은 높은 정확도를 달성하지만 순차적 디코딩으로 인해 병렬 처리가 제한되고 높은 지연 시간이 발생합니다. 본 연구에서는 음성 인식을 조건부 텍스트 편집으로 재구성하여 완전 병렬 예측이 가능한 비자동회귀(NAR) 방식인 NLE를 제안합니다. NLE는 사전 학습된 음성 인코더에서 음향 임베딩과 초기 가설을 추출한 후, 잠재 정렬 목적함수로 학습된 양방향 LLM 편집기를 사용하여 가설을 정제합니다. 인터리브 패딩 전략은 Transformer의 항등 매핑 편향을 활용하여 모델이 전체 재구성이 아닌 수정에 집중할 수 있도록 합니다. Open ASR 리더보드에서 NLE++는 5.67%의 평균 WER과 1630의 RTFx(실시간 계수의 역수)를 달성했습니다. 단일 발화 시나리오에서 NLE는 AR 기준 대비 27배의 속도 향상을 보여 실시간 응용에 적합함을 입증했습니다.

English

While autoregressive (AR) LLM-based ASR systems achieve strong accuracy, their sequential decoding limits parallelism and incurs high latency. We propose NLE, a non-autoregressive (NAR) approach that formulates speech recognition as conditional transcript editing, enabling fully parallel prediction. NLE extracts acoustic embeddings and an initial hypothesis from a pretrained speech encoder, then refines the hypothesis using a bidirectional LLM editor trained with a latent alignment objective. An interleaved padding strategy exploits the identity mapping bias of Transformers, allowing the model to focus on corrections rather than full reconstruction. On the Open ASR leaderboard, NLE++ achieves 5.67% average WER with an RTFx (inverse real-time factor) of 1630. In single-utterance scenarios, NLE achieves 27x speedup over the AR baseline, making it suitable for real-time applications.

NLE: 전사 편집 기반 비자회귀 대규모 언어 모델 음성 인식

NLE: Non-autoregressive LLM-based ASR by Transcript Editing

초록

Support