SpeechGPT:賦予大型語言模型內在的跨模態對話能力
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
May 18, 2023
作者: Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, Xipeng Qiu
cs.AI
摘要
多模式大型語言模型被視為通往人工通用智能(AGI)的關鍵一步,並隨著 ChatGPT 的出現而引起了相當大的興趣。然而,目前的語音語言模型通常採用串聯範式,阻礙了跨模態知識傳輸。在本文中,我們提出了 SpeechGPT,這是一個具有內在跨模態對話能力的大型語言模型,能夠感知和生成多模式內容。我們首先使用離散語音表示來構建 SpeechInstruct,這是一個大規模的跨模態語音指導數據集。此外,我們採用了三階段訓練策略,包括模態適應預訓練、跨模態指導微調和模態鏈指導微調。實驗結果表明,SpeechGPT 具有出色的能力來遵循多模式人類指令,並突顯了使用一個模型處理多個模態的潛力。演示請參見 https://0nutation.github.io/SpeechGPT.github.io/。
English
Multi-modal large language models are regarded as a crucial step towards
Artificial General Intelligence (AGI) and have garnered significant interest
with the emergence of ChatGPT. However, current speech-language models
typically adopt the cascade paradigm, preventing inter-modal knowledge
transfer. In this paper, we propose SpeechGPT, a large language model with
intrinsic cross-modal conversational abilities, capable of perceiving and
generating multi-model content. With discrete speech representations, we first
construct SpeechInstruct, a large-scale cross-modal speech instruction dataset.
Additionally, we employ a three-stage training strategy that includes
modality-adaptation pre-training, cross-modal instruction fine-tuning, and
chain-of-modality instruction fine-tuning. The experimental results demonstrate
that SpeechGPT has an impressive capacity to follow multi-modal human
instructions and highlight the potential of handling multiple modalities with
one model. Demos are shown in https://0nutation.github.io/SpeechGPT.github.io/.