ChatPaper.aiChatPaper

SpeechGPT:赋予大型语言模型内在的跨模态对话能力

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

May 18, 2023
作者: Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, Xipeng Qiu
cs.AI

摘要

多模态大型语言模型被视为通往人工通用智能(AGI)的关键一步,并随着ChatGPT的出现引起了极大的兴趣。然而,当前的语音-语言模型通常采用级联范式,阻碍了跨模态知识传递。本文提出了SpeechGPT,这是一个具有内在跨模态对话能力的大型语言模型,能够感知和生成多模态内容。我们首先利用离散语音表示构建了SpeechInstruct,这是一个大规模的跨模态语音指导数据集。此外,我们采用了三阶段训练策略,包括模态适应预训练、跨模态指导微调和链式模态指导微调。实验结果表明,SpeechGPT具有出色的能力来遵循多模态人类指令,并突显了用一个模型处理多种模态的潜力。演示请参见https://0nutation.github.io/SpeechGPT.github.io/。
English
Multi-modal large language models are regarded as a crucial step towards Artificial General Intelligence (AGI) and have garnered significant interest with the emergence of ChatGPT. However, current speech-language models typically adopt the cascade paradigm, preventing inter-modal knowledge transfer. In this paper, we propose SpeechGPT, a large language model with intrinsic cross-modal conversational abilities, capable of perceiving and generating multi-model content. With discrete speech representations, we first construct SpeechInstruct, a large-scale cross-modal speech instruction dataset. Additionally, we employ a three-stage training strategy that includes modality-adaptation pre-training, cross-modal instruction fine-tuning, and chain-of-modality instruction fine-tuning. The experimental results demonstrate that SpeechGPT has an impressive capacity to follow multi-modal human instructions and highlight the potential of handling multiple modalities with one model. Demos are shown in https://0nutation.github.io/SpeechGPT.github.io/.
PDF42December 15, 2024