文本预训练语音语言模型

摘要

语音语言模型（SpeechLMs）仅处理和生成声学数据，而无需文本监督。在这项工作中，我们提出了TWIST，一种使用预训练文本语言模型的热启动来训练SpeechLMs的方法。我们通过自动和人工评估表明，TWIST在各方面均优于从零开始的SpeechLM。我们从实证角度分析了不同模型设计选择的影响，如语音分词器、预训练文本模型和数据集大小。我们发现模型和数据集规模在构建性能更好的SpeechLMs方面都起着重要作用。根据我们的观察，我们提出了目前为止参数数量和训练数据方面最大的SpeechLM。此外，我们还引入了两个StoryCloze文本基准的口语版本，以进一步改进模型评估并推动未来在该领域的研究。语音样本可在我们的网站上找到：https://pages.cs.huji.ac.il/adiyoss-lab/twist/。

English

Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model design choices such as the speech tokenizer, the pretrained textual model, and the dataset size. We find that model and dataset scale both play an important role in constructing better-performing SpeechLMs. Based on our observations, we present the largest (to the best of our knowledge) SpeechLM both in terms of number of parameters and training data. We additionally introduce two spoken versions of the StoryCloze textual benchmark to further improve model evaluation and advance future research in the field. Speech samples can be found on our website: https://pages.cs.huji.ac.il/adiyoss-lab/twist/ .

文本预训练语音语言模型

Textually Pretrained Speech Language Models

摘要

Support