基於音頻條件的擴散式大型語言模型於語音辨識與審議處理之應用
Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing
September 20, 2025
作者: Mengqi Wang, Zhan Liu, Zengrui Jin, Guangzhi Sun, Chao Zhang, Philip C. Woodland
cs.AI
摘要
基於擴散的大型語言模型(DLLMs)近期作為自迴歸解碼器的替代方案,引起了越來越多的關注。在本研究中,我們探討了使用基於擴散的大型語言模型LLaDA進行自動語音識別(ASR)的實證研究。我們首先研究了其作為Whisper-LLaMA轉錄的外部審議處理模塊的應用。通過利用LLaDA的雙向注意力與去噪能力,我們探索了隨機遮罩、低置信度遮罩以及半自迴歸策略,結果顯示Whisper-LLaDA相較於基準顯著降低了詞錯誤率(WER)。在LibriSpeech數據集上,最佳級聯系統在test-clean/test-other上分別達到了2.25%/4.94%的WER,這意味著在test-other子集上相較於Whisper-LLaMA基準實現了12.3%的相對提升。相比之下,未結合音頻特徵的純文本LLaDA未能提升識別準確度,這凸顯了音頻條件嵌入的重要性。我們進一步評估了Whisper-LLaDA作為ASR獨立解碼器的性能,採用基於擴散和半自迴歸的解碼策略。大多數實驗配置在推理速度上快於Whisper-LLaMA基準,儘管識別準確度略有下降。這些發現為基於擴散的LLMs在ASR中的應用提供了實證視角,並指出了改進的潛在方向。
English
Diffusion-based large language models (DLLMs) have recently attracted growing
interest as an alternative to autoregressive decoders. In this work, we present
an empirical study on using the diffusion-based large language model LLaDA for
automatic speech recognition (ASR). We first investigate its use as an external
deliberation-based processing module for Whisper-LLaMA transcripts. By
leveraging the bidirectional attention and denoising capabilities of LLaDA, we
explore random masking, low-confidence masking, and semi-autoregressive
strategies, showing that Whisper-LLaDA substantially reduces WER compared with
the baseline. On LibriSpeech, the best cascade system achieves 2.25%/4.94% WER
on test-clean/test-other, representing a 12.3% relative improvement over the
Whisper-LLaMA baseline on the test-other split. In contrast, a plain-text LLaDA
without acoustic features fails to improve accuracy, highlighting the
importance of audio-conditioned embeddings. We further evaluate Whisper-LLaDA
as a standalone decoder for ASR with diffusion-based and semi-autoregressive
decoding. Most experimental configurations achieve faster inference than the
Whisper-LLaMA baseline, although recognition accuracy is slightly lower. These
findings offer an empirical view of diffusion-based LLMs for ASR and point to
promising directions for improvements.