Drax：基于离散流匹配的语音识别

摘要

扩散模型和基于流的非自回归（NAR）模型在大规模语言建模中展现出显著潜力，然而它们在自动语音识别（ASR）领域的应用仍待深入探索。我们提出了Drax，一种面向ASR的离散流匹配框架，支持高效的并行解码。为了更好地对齐训练与推理过程，我们构建了一种音频条件概率路径，引导模型沿着类似于可能中间推理错误的轨迹进行学习，而非直接从随机噪声到目标状态的转换。理论分析表明，泛化差距与训练和推理占用之间的差异相关，这些差异由累积速度误差控制，从而为我们的设计选择提供了理论依据。实证评估显示，该方法在识别准确率上可与最先进的语音模型相媲美，同时提供了更优的准确率-效率权衡，凸显了离散流匹配作为推进NAR ASR发展的一个富有前景的方向。

English

Diffusion and flow-based non-autoregressive (NAR) models have shown strong promise in large language modeling, however, their potential for automatic speech recognition (ASR) remains largely unexplored. We propose Drax, a discrete flow matching framework for ASR that enables efficient parallel decoding. To better align training with inference, we construct an audio-conditioned probability path that guides the model through trajectories resembling likely intermediate inference errors, rather than direct random noise to target transitions. Our theoretical analysis links the generalization gap to divergences between training and inference occupancies, controlled by cumulative velocity errors, thereby motivating our design choice. Empirical evaluation demonstrates that our approach attains recognition accuracy on par with state-of-the-art speech models while offering improved accuracy-efficiency trade-offs, highlighting discrete flow matching as a promising direction for advancing NAR ASR.

Drax：基于离散流匹配的语音识别

Drax: Speech Recognition with Discrete Flow Matching

摘要

Support