Drax：基于离散流匹配的语音识别

摘要

扩散与基于流的非自回归（NAR）模型在大规模语言建模中展现出显著潜力，然而，其在自动语音识别（ASR）领域的应用潜力尚待深入挖掘。本文提出Drax，一种面向ASR的离散流匹配框架，旨在实现高效的并行解码。为了更佳地协调训练与推理过程，我们构建了一种音频条件概率路径，该路径引导模型穿越一系列模拟可能中间推理误差的轨迹，而非直接从随机噪声向目标状态过渡。理论分析揭示了泛化差距与训练推理占据度之间的差异，这一差异受累积速度误差调控，从而为我们的设计选择提供了理论依据。实证评估表明，我们的方法在识别准确率上可与最先进的语音模型相媲美，同时提供了更优的准确率与效率权衡，凸显了离散流匹配作为推动NAR ASR发展的一个富有前景的方向。

English

Diffusion and flow-based non-autoregressive (NAR) models have shown strong promise in large language modeling, however, their potential for automatic speech recognition (ASR) remains largely unexplored. We propose Drax, a discrete flow matching framework for ASR that enables efficient parallel decoding. To better align training with inference, we construct an audio-conditioned probability path that guides the model through trajectories resembling likely intermediate inference errors, rather than direct random noise to target transitions. Our theoretical analysis links the generalization gap to divergences between training and inference occupancies, controlled by cumulative velocity errors, thereby motivating our design choice. Empirical evaluation demonstrates that our approach attains recognition accuracy on par with state-of-the-art speech models while offering improved accuracy-efficiency trade-offs, highlighting discrete flow matching as a promising direction for advancing NAR ASR.

Drax：基于离散流匹配的语音识别

Drax: Speech Recognition with Discrete Flow Matching

摘要

Support