DeCRED:面向编解码器语音识别的解码器中心化正则化方法
DeCRED: Decoder-Centric Regularization for Encoder-Decoder Based Speech Recognition
August 12, 2025
作者: Alexander Polok, Santosh Kesiraju, Karel Beneš, Bolaji Yusuf, Lukáš Burget, Jan Černocký
cs.AI
摘要
本文提出了一种简单而有效的正则化方法,用于优化编码器-解码器自动语音识别(ASR)模型中解码器所诱导的内部语言模型,从而提升模型在域内和域外场景下的鲁棒性与泛化能力。所提出的方法称为解码器中心正则化(DeCRED),通过在解码器中添加辅助分类器,利用中间逻辑值实现下一词元预测。实验表明,DeCRED在11个测试集上相对降低了内部语言模型的BPE困惑度达36.6%。此外,该方法在7个域内测试集中的5个以及4个域外测试集中的3个上,均实现了相对于基线的词错误率(WER)改进,将宏平均WER分别从6.4%降至6.3%和从18.2%降至16.2%。在TEDLIUM3数据集上,DeCRED取得了7.0%的WER,较基线和编码器中心的InterCTC正则化分别提升了0.6%和0.5%。最后,我们将DeCRED与OWSM v3.1及Whisper-medium进行了对比,结果显示尽管DeCRED在训练数据量和参数规模上远小于后者,仍能取得具有竞争力的WER表现。
English
This paper presents a simple yet effective regularization for the internal
language model induced by the decoder in encoder-decoder ASR models, thereby
improving robustness and generalization in both in- and out-of-domain settings.
The proposed method, Decoder-Centric Regularization in Encoder-Decoder
(DeCRED), adds auxiliary classifiers to the decoder, enabling next token
prediction via intermediate logits. Empirically, DeCRED reduces the mean
internal LM BPE perplexity by 36.6% relative to 11 test sets. Furthermore, this
translates into actual WER improvements over the baseline in 5 of 7 in-domain
and 3 of 4 out-of-domain test sets, reducing macro WER from 6.4% to 6.3% and
18.2% to 16.2%, respectively. On TEDLIUM3, DeCRED achieves 7.0% WER, surpassing
the baseline and encoder-centric InterCTC regularization by 0.6% and 0.5%,
respectively. Finally, we compare DeCRED with OWSM v3.1 and Whisper-medium,
showing competitive WERs despite training on much less data with fewer
parameters.