SpiRit-LM:交錯式口語與書面語言模型
SpiRit-LM: Interleaved Spoken and Written Language Model
February 8, 2024
作者: Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Benoit Sagot, Emmanuel Dupoux
cs.AI
摘要
我們介紹了 SPIRIT-LM,這是一個基礎的多模態語言模型,可以自由地混合文本和語音。我們的模型基於預訓練的文本語言模型,通過持續在文本和語音單元上進行訓練來擴展到語音模態。語音和文本序列被串聯為一組標記,並使用一種單詞級交錯方法在一個小型自動匹配的語音-文本平行語料庫上進行訓練。SPIRIT-LM有兩個版本:一個使用語音語義單元的基本版本,以及一個使用音高和風格單元來建模表達能力的 EXPRESSIVE 版本,除了語義單元外還包括音高和風格單元。對於這兩個版本,文本使用子詞 BPE 標記進行編碼。結果顯示,該模型展示了文本模型的語義能力和語音模型的表達能力。此外,我們展示了 SPIRIT-LM 能夠跨模態(即語音識別、文本轉語音、語音分類)以少量樣本學習新任務的能力。
English
We introduce SPIRIT-LM, a foundation multimodal language model that freely
mixes text and speech. Our model is based on a pretrained text language model
that we extend to the speech modality by continuously training it on text and
speech units. Speech and text sequences are concatenated as a single set of
tokens, and trained with a word-level interleaving method using a small
automatically-curated speech-text parallel corpus. SPIRIT-LM comes in two
versions: a BASE version that uses speech semantic units and an EXPRESSIVE
version that models expressivity using pitch and style units in addition to the
semantic units. For both versions, the text is encoded with subword BPE tokens.
The resulting model displays both the semantic abilities of text models and the
expressive abilities of speech models. Additionally, we demonstrate that
SPIRIT-LM is able to learn new tasks in a few-shot fashion across modalities
(i.e. ASR, TTS, Speech Classification).