遮蔽音频文本编码器是有效的多模态重排器。

摘要

掩码语言模型（MLMs）已被证明在自动语音识别（ASR）系统的二次重评分中非常有效。在这项工作中，我们提出了掩码音频文本编码器（MATE），这是一个多模态掩码语言模型重评分器，将声学表示合并到MLM的输入空间中。我们采用对比学习来有效地通过学习共享表示来对齐模态。我们展示了使用多模态重评分器对ASR系统的领域泛化是有益的，尤其是在目标领域数据不可用时。MATE在领域内数据上将词错误率（WER）降低了4%-16%，在领域外数据上降低了3%-7%，相较于仅文本的基准线。此外，即使只有非常有限的训练数据（0.8小时），MATE相较于第一遍基准线实现了8%-23%的WER降低。

English

Masked Language Models (MLMs) have proven to be effective for second-pass rescoring in Automatic Speech Recognition (ASR) systems. In this work, we propose Masked Audio Text Encoder (MATE), a multi-modal masked language model rescorer which incorporates acoustic representations into the input space of MLM. We adopt contrastive learning for effectively aligning the modalities by learning shared representations. We show that using a multi-modal rescorer is beneficial for domain generalization of the ASR system when target domain data is unavailable. MATE reduces word error rate (WER) by 4%-16% on in-domain, and 3%-7% on out-of-domain datasets, over the text-only baseline. Additionally, with very limited amount of training data (0.8 hours), MATE achieves a WER reduction of 8%-23% over the first-pass baseline.

遮蔽音频文本编码器是有效的多模态重排器。

Masked Audio Text Encoders are Effective Multi-Modal Rescorers

摘要

Support