NLE:基于转录文本编辑的非自回归大型语言模型语音识别
NLE: Non-autoregressive LLM-based ASR by Transcript Editing
March 9, 2026
作者: Avihu Dekel, Samuel Thomas, Takashi Fukada, George Saon
cs.AI
摘要
虽然基于自回归(AR)大语言模型的语音识别系统具有较高准确率,但其序列化解码方式限制了并行性并导致高延迟。我们提出NLE这一非自回归(NAR)方法,将语音识别定义为条件式文本编辑任务,实现完全并行预测。NLE从预训练语音编码器中提取声学嵌入和初始假设文本,随后通过采用潜在对齐目标训练的双向LLM编辑器进行文本优化。通过交错填充策略利用Transformer的恒等映射偏置特性,使模型专注于修正而非完整重构。在Open ASR评测平台上,NLE++以1630的RTFx(实时因子倒数)实现5.67%的平均词错误率。在单语句场景下,NLE相较AR基线实现27倍加速,展现出实时应用的潜力。
English
While autoregressive (AR) LLM-based ASR systems achieve strong accuracy, their sequential decoding limits parallelism and incurs high latency. We propose NLE, a non-autoregressive (NAR) approach that formulates speech recognition as conditional transcript editing, enabling fully parallel prediction. NLE extracts acoustic embeddings and an initial hypothesis from a pretrained speech encoder, then refines the hypothesis using a bidirectional LLM editor trained with a latent alignment objective. An interleaved padding strategy exploits the identity mapping bias of Transformers, allowing the model to focus on corrections rather than full reconstruction. On the Open ASR leaderboard, NLE++ achieves 5.67% average WER with an RTFx (inverse real-time factor) of 1630. In single-utterance scenarios, NLE achieves 27x speedup over the AR baseline, making it suitable for real-time applications.