テキスト事前学習された音声言語モデル

要旨

音声言語モデル（SpeechLMs）は、テキストの監督なしに音響データのみを処理および生成する。本研究では、事前学習されたテキスト言語モデルからのウォームスタートを用いてSpeechLMsを訓練する方法であるTWISTを提案する。自動評価および人間評価の両方を用いて、TWISTがコールドスタートのSpeechLMを全体的に上回ることを示す。音声トークナイザー、事前学習されたテキストモデル、データセットサイズなどの異なるモデル設計選択の効果を実証的に分析する。モデルとデータセットのスケールの両方が、より高性能なSpeechLMsを構築する上で重要な役割を果たすことを見出した。我々の観察に基づき、パラメータ数と訓練データの両方において、これまでで最大規模（我々の知る限り）のSpeechLMを提示する。さらに、モデル評価をさらに改善し、この分野の将来の研究を進めるために、StoryClozeテキストベンチマークの2つの音声版を導入する。音声サンプルは以下のウェブサイトで確認できる：https://pages.cs.huji.ac.il/adiyoss-lab/twist/。

English

Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model design choices such as the speech tokenizer, the pretrained textual model, and the dataset size. We find that model and dataset scale both play an important role in constructing better-performing SpeechLMs. Based on our observations, we present the largest (to the best of our knowledge) SpeechLM both in terms of number of parameters and training data. We additionally introduce two spoken versions of the StoryCloze textual benchmark to further improve model evaluation and advance future research in the field. Speech samples can be found on our website: https://pages.cs.huji.ac.il/adiyoss-lab/twist/ .

テキスト事前学習された音声言語モデル

Textually Pretrained Speech Language Models

要旨

Support