DeCRED：基於編碼器-解碼器架構的語音識別之解碼器中心化正則化方法

摘要

本文提出了一种简单而有效的正则化方法，用于改进编码器-解码器自动语音识别（ASR）模型中由解码器诱导的内部语言模型，从而增强模型在域内及域外环境下的鲁棒性与泛化能力。所提出的方法称为“解码器中心的正则化在编码器-解码器中的应用”（DeCRED），通过在解码器中添加辅助分类器，利用中间逻辑值实现下一令牌预测。实验表明，DeCRED相对于11个测试集，将内部语言模型的BPE困惑度平均降低了36.6%。此外，在7个域内测试集中的5个及4个域外测试集中的3个上，该方法相较于基线模型实现了实际词错误率（WER）的改善，分别将宏平均WER从6.4%降至6.3%，以及从18.2%降至16.2%。在TEDLIUM3数据集上，DeCRED取得了7.0%的WER，分别比基线模型和以编码器为中心的InterCTC正则化方法高出0.6%和0.5%。最后，我们将DeCRED与OWSM v3.1及Whisper-medium进行了比较，尽管在更少数据和更少参数的情况下进行训练，DeCRED仍展现出具有竞争力的WER表现。

English

This paper presents a simple yet effective regularization for the internal language model induced by the decoder in encoder-decoder ASR models, thereby improving robustness and generalization in both in- and out-of-domain settings. The proposed method, Decoder-Centric Regularization in Encoder-Decoder (DeCRED), adds auxiliary classifiers to the decoder, enabling next token prediction via intermediate logits. Empirically, DeCRED reduces the mean internal LM BPE perplexity by 36.6% relative to 11 test sets. Furthermore, this translates into actual WER improvements over the baseline in 5 of 7 in-domain and 3 of 4 out-of-domain test sets, reducing macro WER from 6.4% to 6.3% and 18.2% to 16.2%, respectively. On TEDLIUM3, DeCRED achieves 7.0% WER, surpassing the baseline and encoder-centric InterCTC regularization by 0.6% and 0.5%, respectively. Finally, we compare DeCRED with OWSM v3.1 and Whisper-medium, showing competitive WERs despite training on much less data with fewer parameters.

DeCRED：基於編碼器-解碼器架構的語音識別之解碼器中心化正則化方法

DeCRED: Decoder-Centric Regularization for Encoder-Decoder Based Speech Recognition

摘要

Support