SpiRit-LM:交错式口头和书面语言模型
SpiRit-LM: Interleaved Spoken and Written Language Model
February 8, 2024
作者: Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Benoit Sagot, Emmanuel Dupoux
cs.AI
摘要
我们介绍了SPIRIT-LM,这是一个基于多模态的语言模型,自由地结合了文本和语音。我们的模型基于一个预训练的文本语言模型,通过持续在文本和语音单元上训练来扩展到语音模态。语音和文本序列被连接为一个标记集,并使用单词级交错方法在一个小型自动筛选的语音文本平行语料库上进行训练。SPIRIT-LM有两个版本:一个使用语音语义单元的基础版本,另一个使用音高和风格单元来建模表现力的表现版本,除了语义单元外。对于这两个版本,文本使用子词BPE标记进行编码。结果模型展示了文本模型的语义能力和语音模型的表现能力。此外,我们展示了SPIRIT-LM能够跨模态(即ASR、TTS、语音分类)以少量样本学习新任务的能力。
English
We introduce SPIRIT-LM, a foundation multimodal language model that freely
mixes text and speech. Our model is based on a pretrained text language model
that we extend to the speech modality by continuously training it on text and
speech units. Speech and text sequences are concatenated as a single set of
tokens, and trained with a word-level interleaving method using a small
automatically-curated speech-text parallel corpus. SPIRIT-LM comes in two
versions: a BASE version that uses speech semantic units and an EXPRESSIVE
version that models expressivity using pitch and style units in addition to the
semantic units. For both versions, the text is encoded with subword BPE tokens.
The resulting model displays both the semantic abilities of text models and the
expressive abilities of speech models. Additionally, we demonstrate that
SPIRIT-LM is able to learn new tasks in a few-shot fashion across modalities
(i.e. ASR, TTS, Speech Classification).