Drax:基于离散流匹配的语音识别
Drax: Speech Recognition with Discrete Flow Matching
October 5, 2025
作者: Aviv Navon, Aviv Shamsian, Neta Glazer, Yael Segal-Feldman, Gill Hetz, Joseph Keshet, Ethan Fetaya
cs.AI
摘要
扩散模型和基于流的非自回归(NAR)模型在大规模语言建模中展现出显著潜力,然而它们在自动语音识别(ASR)领域的应用仍待深入探索。我们提出了Drax,一种面向ASR的离散流匹配框架,支持高效的并行解码。为了更好地对齐训练与推理过程,我们构建了一种音频条件概率路径,引导模型沿着类似于可能中间推理错误的轨迹进行学习,而非直接从随机噪声到目标状态的转换。理论分析表明,泛化差距与训练和推理占用之间的差异相关,这些差异由累积速度误差控制,从而为我们的设计选择提供了理论依据。实证评估显示,该方法在识别准确率上可与最先进的语音模型相媲美,同时提供了更优的准确率-效率权衡,凸显了离散流匹配作为推进NAR ASR发展的一个富有前景的方向。
English
Diffusion and flow-based non-autoregressive (NAR) models have shown strong
promise in large language modeling, however, their potential for automatic
speech recognition (ASR) remains largely unexplored. We propose Drax, a
discrete flow matching framework for ASR that enables efficient parallel
decoding. To better align training with inference, we construct an
audio-conditioned probability path that guides the model through trajectories
resembling likely intermediate inference errors, rather than direct random
noise to target transitions. Our theoretical analysis links the generalization
gap to divergences between training and inference occupancies, controlled by
cumulative velocity errors, thereby motivating our design choice. Empirical
evaluation demonstrates that our approach attains recognition accuracy on par
with state-of-the-art speech models while offering improved accuracy-efficiency
trade-offs, highlighting discrete flow matching as a promising direction for
advancing NAR ASR.