SpeechGPT：赋予大型语言模型内在的跨模态对话能力

摘要

多模态大型语言模型被视为通往人工通用智能（AGI）的关键一步，并随着ChatGPT的出现引起了极大的兴趣。然而，当前的语音-语言模型通常采用级联范式，阻碍了跨模态知识传递。本文提出了SpeechGPT，这是一个具有内在跨模态对话能力的大型语言模型，能够感知和生成多模态内容。我们首先利用离散语音表示构建了SpeechInstruct，这是一个大规模的跨模态语音指导数据集。此外，我们采用了三阶段训练策略，包括模态适应预训练、跨模态指导微调和链式模态指导微调。实验结果表明，SpeechGPT具有出色的能力来遵循多模态人类指令，并突显了用一个模型处理多种模态的潜力。演示请参见https://0nutation.github.io/SpeechGPT.github.io/。

English

Multi-modal large language models are regarded as a crucial step towards Artificial General Intelligence (AGI) and have garnered significant interest with the emergence of ChatGPT. However, current speech-language models typically adopt the cascade paradigm, preventing inter-modal knowledge transfer. In this paper, we propose SpeechGPT, a large language model with intrinsic cross-modal conversational abilities, capable of perceiving and generating multi-model content. With discrete speech representations, we first construct SpeechInstruct, a large-scale cross-modal speech instruction dataset. Additionally, we employ a three-stage training strategy that includes modality-adaptation pre-training, cross-modal instruction fine-tuning, and chain-of-modality instruction fine-tuning. The experimental results demonstrate that SpeechGPT has an impressive capacity to follow multi-modal human instructions and highlight the potential of handling multiple modalities with one model. Demos are shown in https://0nutation.github.io/SpeechGPT.github.io/.

SpeechGPT：赋予大型语言模型内在的跨模态对话能力

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

摘要

Support