DeCRED: エンコーダ-デコーダ型音声認識のためのデコーダ中心正則化

要旨

本論文では、エンコーダ-デコーダ型ASRモデルにおけるデコーダによって誘導される内部言語モデルのための、シンプルでありながら効果的な正則化手法を提案し、ドメイン内およびドメイン外の設定におけるロバスト性と汎化性能の向上を実現する。提案手法であるDeCRED（Decoder-Centric Regularization in Encoder-Decoder）は、デコーダに補助分類器を追加し、中間ロジットを介して次のトークンの予測を可能にする。実験的に、DeCREDは11のテストセットにおいて、内部言語モデルのBPEパープレキシティを36.6%相対的に低減した。さらに、これは実際のWER改善につながり、7つのドメイン内テストセットのうち5つ、4つのドメイン外テストセットのうち3つでベースラインを上回り、マクロWERをそれぞれ6.4%から6.3%、18.2%から16.2%に低減した。TEDLIUM3では、DeCREDは7.0%のWERを達成し、ベースラインおよびエンコーダ中心のInterCTC正則化をそれぞれ0.6%、0.5%上回った。最後に、DeCREDをOWSM v3.1およびWhisper-mediumと比較し、はるかに少ないデータとパラメータで学習したにもかかわらず、競争力のあるWERを示した。

English

This paper presents a simple yet effective regularization for the internal language model induced by the decoder in encoder-decoder ASR models, thereby improving robustness and generalization in both in- and out-of-domain settings. The proposed method, Decoder-Centric Regularization in Encoder-Decoder (DeCRED), adds auxiliary classifiers to the decoder, enabling next token prediction via intermediate logits. Empirically, DeCRED reduces the mean internal LM BPE perplexity by 36.6% relative to 11 test sets. Furthermore, this translates into actual WER improvements over the baseline in 5 of 7 in-domain and 3 of 4 out-of-domain test sets, reducing macro WER from 6.4% to 6.3% and 18.2% to 16.2%, respectively. On TEDLIUM3, DeCRED achieves 7.0% WER, surpassing the baseline and encoder-centric InterCTC regularization by 0.6% and 0.5%, respectively. Finally, we compare DeCRED with OWSM v3.1 and Whisper-medium, showing competitive WERs despite training on much less data with fewer parameters.

DeCRED: エンコーダ-デコーダ型音声認識のためのデコーダ中心正則化

DeCRED: Decoder-Centric Regularization for Encoder-Decoder Based Speech Recognition

要旨

Support