面向语音识别与审议处理的音频条件扩散大语言模型

摘要

基于扩散的大型语言模型（DLLMs）作为自回归解码器的替代方案，近期引起了越来越多的关注。在本研究中，我们探讨了使用扩散型大型语言模型LLaDA进行自动语音识别（ASR）的实证研究。首先，我们考察了其作为Whisper-LLaMA转录本外部审议处理模块的应用。通过利用LLaDA的双向注意力机制和去噪能力，我们探索了随机掩码、低置信度掩码及半自回归策略，结果表明Whisper-LLaDA相较于基线显著降低了词错误率（WER）。在LibriSpeech数据集上，最佳级联系统在test-clean/test-other上分别达到了2.25%/4.94%的WER，相较于Whisper-LLaMA基线在test-other子集上实现了12.3%的相对提升。相比之下，未结合声学特征的纯文本LLaDA未能提升识别精度，凸显了音频条件嵌入的重要性。我们进一步评估了Whisper-LLaDA作为独立解码器在ASR任务中的表现，采用扩散型与半自回归解码策略。尽管大多数实验配置在推理速度上快于Whisper-LLaMA基线，但识别精度略有下降。这些发现为基于扩散的LLMs在ASR领域的应用提供了实证视角，并指明了未来改进的潜在方向。

English

Diffusion-based large language models (DLLMs) have recently attracted growing interest as an alternative to autoregressive decoders. In this work, we present an empirical study on using the diffusion-based large language model LLaDA for automatic speech recognition (ASR). We first investigate its use as an external deliberation-based processing module for Whisper-LLaMA transcripts. By leveraging the bidirectional attention and denoising capabilities of LLaDA, we explore random masking, low-confidence masking, and semi-autoregressive strategies, showing that Whisper-LLaDA substantially reduces WER compared with the baseline. On LibriSpeech, the best cascade system achieves 2.25%/4.94% WER on test-clean/test-other, representing a 12.3% relative improvement over the Whisper-LLaMA baseline on the test-other split. In contrast, a plain-text LLaDA without acoustic features fails to improve accuracy, highlighting the importance of audio-conditioned embeddings. We further evaluate Whisper-LLaDA as a standalone decoder for ASR with diffusion-based and semi-autoregressive decoding. Most experimental configurations achieve faster inference than the Whisper-LLaMA baseline, although recognition accuracy is slightly lower. These findings offer an empirical view of diffusion-based LLMs for ASR and point to promising directions for improvements.

面向语音识别与审议处理的音频条件扩散大语言模型

Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

摘要

Support