实现语音单元和文本的联合语言建模

摘要

语音和文本是人类语言的两种主要形式。研究界多年来一直致力于将语音映射到文本，或反之亦然。然而，在语言建模领域，很少有工作是同时对其进行建模的。鉴此，我们探索了语音单元和文本的联合语言建模。具体而言，我们比较了不同的语音标记器，将连续的语音信号转换为离散单元，并使用不同的方法构建混合语音文本数据。我们引入了自动度量标准来评估联合语言建模如何混合语音和文本。我们还对下游口语理解（SLU）任务上的LM进行微调，使用不同的模态（语音或文本），并测试其性能，以评估模型对共享表示的学习情况。我们的结果表明，通过使用我们提出的混合技术混合语音单元和文本，联合LM在SLU任务上优于仅使用语音的基准线，并展现了零-shot跨模态可转移性。

English

Speech and text are two major forms of human language. The research community has been focusing on mapping speech to text or vice versa for many years. However, in the field of language modeling, very little effort has been made to model them jointly. In light of this, we explore joint language modeling for speech units and text. Specifically, we compare different speech tokenizers to transform continuous speech signals into discrete units and use different methods to construct mixed speech-text data. We introduce automatic metrics to evaluate how well the joint LM mixes speech and text. We also fine-tune the LM on downstream spoken language understanding (SLU) tasks with different modalities (speech or text) and test its performance to assess the model's learning of shared representations. Our results show that by mixing speech units and text with our proposed mixing techniques, the joint LM improves over a speech-only baseline on SLU tasks and shows zero-shot cross-modal transferability.

实现语音单元和文本的联合语言建模

Toward Joint Language Modeling for Speech Units and Text

摘要

Support