朝向語音單位和文本的聯合語言建模

摘要

語音和文本是人類語言的兩種主要形式。研究界多年來一直專注於將語音映射到文本，或反之亦然。然而，在語言建模領域中，很少有工作是將它們聯合建模。鑒於此，我們探索了將語音單元和文本進行聯合語言建模。具體而言，我們比較不同的語音分詞器，將連續的語音信號轉換為離散單元，並使用不同的方法構建混合語音文本數據。我們引入自動指標來評估聯合語言建模器如何混合語音和文本。我們還對下游口語理解（SLU）任務上的語言建模進行微調，使用不同的模態（語音或文本），並測試其性能以評估模型對共享表示的學習情況。我們的結果顯示，通過使用我們提出的混合技術混合語音單元和文本，聯合語言建模在SLU任務上優於僅有語音的基準線，並展示了零-shot跨模態可轉移性。

English

Speech and text are two major forms of human language. The research community has been focusing on mapping speech to text or vice versa for many years. However, in the field of language modeling, very little effort has been made to model them jointly. In light of this, we explore joint language modeling for speech units and text. Specifically, we compare different speech tokenizers to transform continuous speech signals into discrete units and use different methods to construct mixed speech-text data. We introduce automatic metrics to evaluate how well the joint LM mixes speech and text. We also fine-tune the LM on downstream spoken language understanding (SLU) tasks with different modalities (speech or text) and test its performance to assess the model's learning of shared representations. Our results show that by mixing speech units and text with our proposed mixing techniques, the joint LM improves over a speech-only baseline on SLU tasks and shows zero-shot cross-modal transferability.

朝向語音單位和文本的聯合語言建模

Toward Joint Language Modeling for Speech Units and Text

摘要

Support