Drax: 이산 흐름 매칭을 활용한 음성 인식

초록

확산(diffusion) 및 흐름 기반(flow-based) 비자기회귀(non-autoregressive, NAR) 모델들은 대규모 언어 모델링에서 강력한 가능성을 보여왔지만, 자동 음성 인식(automatic speech recognition, ASR)에서의 잠재력은 아직 크게 탐구되지 않았습니다. 우리는 ASR을 위한 효율적인 병렬 디코딩을 가능하게 하는 이산 흐름 매칭(discrete flow matching) 프레임워크인 Drax를 제안합니다. 학습과 추론을 더 잘 정렬하기 위해, 우리는 직접적인 무작위 노이즈에서 목표로의 전이 대신, 가능성이 높은 중간 추론 오류를 닮은 궤적을 통해 모델을 안내하는 오디오 조건부 확률 경로(audio-conditioned probability path)를 구성합니다. 우리의 이론적 분석은 일반화 격차(generalization gap)를 학습과 추론 점유율(occupancies) 간의 차이와 연결하며, 이는 누적 속도 오류(cumulative velocity errors)에 의해 제어됨을 보여줌으로써 우리의 설계 선택을 뒷받침합니다. 실험적 평가는 우리의 접근 방식이 최첨단 음성 모델과 동등한 인식 정확도를 달성하면서도 더 나은 정확도-효율성 균형을 제공함을 보여주며, 이산 흐름 매칭이 NAR ASR을 발전시키기 위한 유망한 방향임을 강조합니다.

English

Diffusion and flow-based non-autoregressive (NAR) models have shown strong promise in large language modeling, however, their potential for automatic speech recognition (ASR) remains largely unexplored. We propose Drax, a discrete flow matching framework for ASR that enables efficient parallel decoding. To better align training with inference, we construct an audio-conditioned probability path that guides the model through trajectories resembling likely intermediate inference errors, rather than direct random noise to target transitions. Our theoretical analysis links the generalization gap to divergences between training and inference occupancies, controlled by cumulative velocity errors, thereby motivating our design choice. Empirical evaluation demonstrates that our approach attains recognition accuracy on par with state-of-the-art speech models while offering improved accuracy-efficiency trade-offs, highlighting discrete flow matching as a promising direction for advancing NAR ASR.

Drax: 이산 흐름 매칭을 활용한 음성 인식

Drax: Speech Recognition with Discrete Flow Matching

초록

Support