无需对齐即可改进语音文本联合表示

摘要

过去一年在基于文本提示的图像生成方面取得了惊人的进展，其基础是跨模态表示空间的概念，其中文本和图像领域共同表示。在自动语音识别中，这一思想被应用为联合语音文本编码器，可以通过在未配对的语音和文本上进行训练来扩展非常大参数模型的容量。虽然这些方法显示出潜力，但它们需要特殊处理语音和文本之间固有的序列长度不匹配问题，要么通过上采样启发式方法，要么通过显式对齐模型。在这项工作中，我们提供证据表明，联合语音文本编码器通过忽略序列长度自然地实现跨模态一致表示，并认为一致性损失可以弥补长度差异，并简单地假定最佳对齐。我们展示这样的损失改善了大参数单语和多语言系统中的下游词错误率。

English

The last year has seen astonishing progress in text-prompted image generation premised on the idea of a cross-modal representation space in which the text and image domains are represented jointly. In ASR, this idea has found application as joint speech-text encoders that can scale to the capacities of very large parameter models by being trained on both unpaired speech and text. While these methods show promise, they have required special treatment of the sequence-length mismatch inherent in speech and text, either by up-sampling heuristics or an explicit alignment model. In this work, we offer evidence that joint speech-text encoders naturally achieve consistent representations across modalities by disregarding sequence length, and argue that consistency losses could forgive length differences and simply assume the best alignment. We show that such a loss improves downstream WER in both a large-parameter monolingual and multilingual system.

无需对齐即可改进语音文本联合表示

Improving Joint Speech-Text Representations Without Alignment

摘要

Support