SpiRit-LM：交错式口头和书面语言模型

摘要

我们介绍了SPIRIT-LM，这是一个基于多模态的语言模型，自由地结合了文本和语音。我们的模型基于一个预训练的文本语言模型，通过持续在文本和语音单元上训练来扩展到语音模态。语音和文本序列被连接为一个标记集，并使用单词级交错方法在一个小型自动筛选的语音文本平行语料库上进行训练。SPIRIT-LM有两个版本：一个使用语音语义单元的基础版本，另一个使用音高和风格单元来建模表现力的表现版本，除了语义单元外。对于这两个版本，文本使用子词BPE标记进行编码。结果模型展示了文本模型的语义能力和语音模型的表现能力。此外，我们展示了SPIRIT-LM能够跨模态（即ASR、TTS、语音分类）以少量样本学习新任务的能力。

English

We introduce SPIRIT-LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single set of tokens, and trained with a word-level interleaving method using a small automatically-curated speech-text parallel corpus. SPIRIT-LM comes in two versions: a BASE version that uses speech semantic units and an EXPRESSIVE version that models expressivity using pitch and style units in addition to the semantic units. For both versions, the text is encoded with subword BPE tokens. The resulting model displays both the semantic abilities of text models and the expressive abilities of speech models. Additionally, we demonstrate that SPIRIT-LM is able to learn new tasks in a few-shot fashion across modalities (i.e. ASR, TTS, Speech Classification).

SpiRit-LM：交错式口头和书面语言模型

SpiRit-LM: Interleaved Spoken and Written Language Model

摘要

Support