NLE：基于转录编辑的非自回归大语言模型语音识别

摘要

虽然基于自回归（AR）大语言模型的ASR系统具备较强准确率，但其序列化解码机制限制了并行性并导致较高延迟。我们提出非自回归（NAR）方法NLE，将语音识别定义为条件式文本编辑任务，实现完全并行预测。NLE首先从预训练语音编码器中提取声学嵌入和初始假设，再通过采用潜在对齐目标训练的双向LLM编辑器优化假设。通过交错填充策略利用Transformer的恒等映射偏置特性，使模型专注于修正而非完全重构。在Open ASR评测平台上，NLE++以1630的RTFx（实时因子倒数）实现5.67%的平均词错误率。在单语句场景下，NLE相比AR基线实现27倍加速，展现出实时应用潜力。

English

While autoregressive (AR) LLM-based ASR systems achieve strong accuracy, their sequential decoding limits parallelism and incurs high latency. We propose NLE, a non-autoregressive (NAR) approach that formulates speech recognition as conditional transcript editing, enabling fully parallel prediction. NLE extracts acoustic embeddings and an initial hypothesis from a pretrained speech encoder, then refines the hypothesis using a bidirectional LLM editor trained with a latent alignment objective. An interleaved padding strategy exploits the identity mapping bias of Transformers, allowing the model to focus on corrections rather than full reconstruction. On the Open ASR leaderboard, NLE++ achieves 5.67% average WER with an RTFx (inverse real-time factor) of 1630. In single-utterance scenarios, NLE achieves 27x speedup over the AR baseline, making it suitable for real-time applications.