NLE: 音声認識のためのトランススクリプト編集に基づく非自己回帰的大規模言語モデル

要旨

自己回帰（AR）型LLMベースの音声認識システムは高い精度を達成する一方で、その逐次的な復号化プロセスは並列性に制限があり、高レイテンシを招く。本研究では、音声認識を条件付き転記編集として定式化し、完全並列予測を可能とする非自己回帰（NAR）アプローチ「NLE」を提案する。NLEは、事前学習済み音声エンコーダから音響埋め込みと初期仮説を抽出し、潜在アライメント目的で学習した双方向LLMエディタを用いて仮説を精緻化する。インターリーブ・パディング戦略はTransformerの恒等写像バイアスを活用し、モデルが完全再構成ではなく修正に集中できるようにする。Open ASRリーダーボードでは、NLE++が5.67%の平均WERと1630のRTFx（実時間係数の逆数）を達成した。単一発話シナリオでは、NLEはARベースライン比27倍の高速化を実現し、リアルタイム応用に適している。

English

While autoregressive (AR) LLM-based ASR systems achieve strong accuracy, their sequential decoding limits parallelism and incurs high latency. We propose NLE, a non-autoregressive (NAR) approach that formulates speech recognition as conditional transcript editing, enabling fully parallel prediction. NLE extracts acoustic embeddings and an initial hypothesis from a pretrained speech encoder, then refines the hypothesis using a bidirectional LLM editor trained with a latent alignment objective. An interleaved padding strategy exploits the identity mapping bias of Transformers, allowing the model to focus on corrections rather than full reconstruction. On the Open ASR leaderboard, NLE++ achieves 5.67% average WER with an RTFx (inverse real-time factor) of 1630. In single-utterance scenarios, NLE achieves 27x speedup over the AR baseline, making it suitable for real-time applications.

NLE: 音声認識のためのトランススクリプト編集に基づく非自己回帰的大規模言語モデル

NLE: Non-autoregressive LLM-based ASR by Transcript Editing

要旨

Support