音声ユニットとテキストのための統合言語モデリングに向けて

要旨

音声とテキストは、人間の言語の主要な2つの形態です。研究コミュニティは長年にわたり、音声をテキストにマッピングする、またはその逆を行うことに焦点を当ててきました。しかし、言語モデリングの分野では、これらを共同でモデル化する取り組みはほとんど行われていません。これを踏まえ、私たちは音声単位とテキストの共同言語モデリングを探求します。具体的には、連続的な音声信号を離散単位に変換するための異なる音声トークナイザーを比較し、混合音声-テキストデータを構築するための異なる手法を使用します。共同言語モデルが音声とテキストをどれだけうまく混合しているかを評価するための自動指標を導入します。また、異なるモダリティ（音声またはテキスト）を用いた下流の音声言語理解（SLU）タスクに対して言語モデルをファインチューニングし、その性能をテストして、モデルが共有表現をどの程度学習しているかを評価します。私たちの結果は、提案した混合技術を用いて音声単位とテキストを混合することで、共同言語モデルが音声のみのベースラインを上回り、ゼロショットのクロスモーダル転送可能性を示すことを示しています。

English

Speech and text are two major forms of human language. The research community has been focusing on mapping speech to text or vice versa for many years. However, in the field of language modeling, very little effort has been made to model them jointly. In light of this, we explore joint language modeling for speech units and text. Specifically, we compare different speech tokenizers to transform continuous speech signals into discrete units and use different methods to construct mixed speech-text data. We introduce automatic metrics to evaluate how well the joint LM mixes speech and text. We also fine-tune the LM on downstream spoken language understanding (SLU) tasks with different modalities (speech or text) and test its performance to assess the model's learning of shared representations. Our results show that by mixing speech units and text with our proposed mixing techniques, the joint LM improves over a speech-only baseline on SLU tasks and shows zero-shot cross-modal transferability.

音声ユニットとテキストのための統合言語モデリングに向けて

Toward Joint Language Modeling for Speech Units and Text

要旨

Support