基於音頻條件的擴散式大型語言模型於語音辨識與審議處理之應用

摘要

基於擴散的大型語言模型（DLLMs）近期作為自迴歸解碼器的替代方案，引起了越來越多的關注。在本研究中，我們探討了使用基於擴散的大型語言模型LLaDA進行自動語音識別（ASR）的實證研究。我們首先研究了其作為Whisper-LLaMA轉錄的外部審議處理模塊的應用。通過利用LLaDA的雙向注意力與去噪能力，我們探索了隨機遮罩、低置信度遮罩以及半自迴歸策略，結果顯示Whisper-LLaDA相較於基準顯著降低了詞錯誤率（WER）。在LibriSpeech數據集上，最佳級聯系統在test-clean/test-other上分別達到了2.25%/4.94%的WER，這意味著在test-other子集上相較於Whisper-LLaMA基準實現了12.3%的相對提升。相比之下，未結合音頻特徵的純文本LLaDA未能提升識別準確度，這凸顯了音頻條件嵌入的重要性。我們進一步評估了Whisper-LLaDA作為ASR獨立解碼器的性能，採用基於擴散和半自迴歸的解碼策略。大多數實驗配置在推理速度上快於Whisper-LLaMA基準，儘管識別準確度略有下降。這些發現為基於擴散的LLMs在ASR中的應用提供了實證視角，並指出了改進的潛在方向。

English

Diffusion-based large language models (DLLMs) have recently attracted growing interest as an alternative to autoregressive decoders. In this work, we present an empirical study on using the diffusion-based large language model LLaDA for automatic speech recognition (ASR). We first investigate its use as an external deliberation-based processing module for Whisper-LLaMA transcripts. By leveraging the bidirectional attention and denoising capabilities of LLaDA, we explore random masking, low-confidence masking, and semi-autoregressive strategies, showing that Whisper-LLaDA substantially reduces WER compared with the baseline. On LibriSpeech, the best cascade system achieves 2.25%/4.94% WER on test-clean/test-other, representing a 12.3% relative improvement over the Whisper-LLaMA baseline on the test-other split. In contrast, a plain-text LLaDA without acoustic features fails to improve accuracy, highlighting the importance of audio-conditioned embeddings. We further evaluate Whisper-LLaDA as a standalone decoder for ASR with diffusion-based and semi-autoregressive decoding. Most experimental configurations achieve faster inference than the Whisper-LLaMA baseline, although recognition accuracy is slightly lower. These findings offer an empirical view of diffusion-based LLMs for ASR and point to promising directions for improvements.

基於音頻條件的擴散式大型語言模型於語音辨識與審議處理之應用

Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

摘要

Support