SpeechGPT：賦予大型語言模型內在的跨模態對話能力

摘要

多模式大型語言模型被視為通往人工通用智能（AGI）的關鍵一步，並隨著 ChatGPT 的出現而引起了相當大的興趣。然而，目前的語音語言模型通常採用串聯範式，阻礙了跨模態知識傳輸。在本文中，我們提出了 SpeechGPT，這是一個具有內在跨模態對話能力的大型語言模型，能夠感知和生成多模式內容。我們首先使用離散語音表示來構建 SpeechInstruct，這是一個大規模的跨模態語音指導數據集。此外，我們採用了三階段訓練策略，包括模態適應預訓練、跨模態指導微調和模態鏈指導微調。實驗結果表明，SpeechGPT 具有出色的能力來遵循多模式人類指令，並突顯了使用一個模型處理多個模態的潛力。演示請參見 https://0nutation.github.io/SpeechGPT.github.io/。

English

Multi-modal large language models are regarded as a crucial step towards Artificial General Intelligence (AGI) and have garnered significant interest with the emergence of ChatGPT. However, current speech-language models typically adopt the cascade paradigm, preventing inter-modal knowledge transfer. In this paper, we propose SpeechGPT, a large language model with intrinsic cross-modal conversational abilities, capable of perceiving and generating multi-model content. With discrete speech representations, we first construct SpeechInstruct, a large-scale cross-modal speech instruction dataset. Additionally, we employ a three-stage training strategy that includes modality-adaptation pre-training, cross-modal instruction fine-tuning, and chain-of-modality instruction fine-tuning. The experimental results demonstrate that SpeechGPT has an impressive capacity to follow multi-modal human instructions and highlight the potential of handling multiple modalities with one model. Demos are shown in https://0nutation.github.io/SpeechGPT.github.io/.

SpeechGPT：賦予大型語言模型內在的跨模態對話能力

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

摘要

Support