遮罩音訊文本編碼器是有效的多模式重評分器。

摘要

遮罩語言模型（MLMs）已被證實對於自動語音識別（ASR）系統中的二次重評分非常有效。在這項研究中，我們提出了遮罩音訊文本編碼器（MATE），這是一種多模態遮罩語言模型重評分器，將聲學表示整合到MLM的輸入空間中。我們採用對比學習來有效地對齊模態，通過學習共享表示。我們展示了使用多模態重評分器對於當目標領域數據不可用時的ASR系統領域泛化是有益的。MATE在領域內可將字錯誤率（WER）降低4%-16%，在領域外數據集上可降低3%-7%，相較於僅使用文本的基準線。此外，僅使用非常有限的訓練數據（0.8小時），MATE在第一遍通過基準線上實現WER降低8%-23%。

English

Masked Language Models (MLMs) have proven to be effective for second-pass rescoring in Automatic Speech Recognition (ASR) systems. In this work, we propose Masked Audio Text Encoder (MATE), a multi-modal masked language model rescorer which incorporates acoustic representations into the input space of MLM. We adopt contrastive learning for effectively aligning the modalities by learning shared representations. We show that using a multi-modal rescorer is beneficial for domain generalization of the ASR system when target domain data is unavailable. MATE reduces word error rate (WER) by 4%-16% on in-domain, and 3%-7% on out-of-domain datasets, over the text-only baseline. Additionally, with very limited amount of training data (0.8 hours), MATE achieves a WER reduction of 8%-23% over the first-pass baseline.

遮罩音訊文本編碼器是有效的多模式重評分器。

Masked Audio Text Encoders are Effective Multi-Modal Rescorers

摘要

Support