文本預訓練語音語言模型

摘要

語音語言模型（SpeechLMs）僅處理和生成聲學數據，而無需文本監督。在這項工作中，我們提出了TWIST，一種使用從預訓練文本語言模型熱啟動的方法來訓練SpeechLMs。我們通過自動和人工評估顯示，TWIST在各方面均優於從頭開始的SpeechLM。我們從實證分析了不同模型設計選擇的影響，如語音分詞器、預訓練文本模型和數據集大小。我們發現模型和數據集規模都在構建性能更好的SpeechLMs中起著重要作用。根據我們的觀察，我們提出了迄今為止最大的SpeechLM，無論是參數數量還是訓練數據。此外，我們還引入了兩個StoryCloze文本基準的口語版本，以進一步改進模型評估並推動未來該領域的研究。有關語音樣本可在我們的網站上找到：https://pages.cs.huji.ac.il/adiyoss-lab/twist/。

English

Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model design choices such as the speech tokenizer, the pretrained textual model, and the dataset size. We find that model and dataset scale both play an important role in constructing better-performing SpeechLMs. Based on our observations, we present the largest (to the best of our knowledge) SpeechLM both in terms of number of parameters and training data. We additionally introduce two spoken versions of the StoryCloze textual benchmark to further improve model evaluation and advance future research in the field. Speech samples can be found on our website: https://pages.cs.huji.ac.il/adiyoss-lab/twist/ .

文本預訓練語音語言模型

Textually Pretrained Speech Language Models

摘要

Support