Drax: 離散フローマッチングを用いた音声認識

要旨

拡散モデルやフローベースの非自己回帰（NAR）モデルは、大規模言語モデリングにおいて強い可能性を示していますが、自動音声認識（ASR）への応用はまだほとんど検討されていません。本論文では、ASRのための効率的な並列デコーディングを可能にする離散フローマッチングフレームワーク「Drax」を提案します。トレーニングと推論の整合性を高めるため、直接的なランダムノイズからターゲットへの遷移ではなく、推論時に起こり得る中間的なエラーに似た軌跡をモデルに案内する、音声条件付き確率経路を構築しました。理論的分析により、一般化ギャップがトレーニングと推論の占有率の差異に起因し、累積速度誤差によって制御されることを示し、この設計選択の動機付けを行いました。実験的評価では、本手法が最先端の音声モデルと同等の認識精度を達成しつつ、精度と効率性のトレードオフを改善できることを示し、離散フローマッチングがNAR ASRを進展させる有望な方向性であることを明らかにしました。

English

Diffusion and flow-based non-autoregressive (NAR) models have shown strong promise in large language modeling, however, their potential for automatic speech recognition (ASR) remains largely unexplored. We propose Drax, a discrete flow matching framework for ASR that enables efficient parallel decoding. To better align training with inference, we construct an audio-conditioned probability path that guides the model through trajectories resembling likely intermediate inference errors, rather than direct random noise to target transitions. Our theoretical analysis links the generalization gap to divergences between training and inference occupancies, controlled by cumulative velocity errors, thereby motivating our design choice. Empirical evaluation demonstrates that our approach attains recognition accuracy on par with state-of-the-art speech models while offering improved accuracy-efficiency trade-offs, highlighting discrete flow matching as a promising direction for advancing NAR ASR.

Drax: 離散フローマッチングを用いた音声認識

Drax: Speech Recognition with Discrete Flow Matching

要旨

Support