SpeechGPT: 내재적 크로스모달 대화 능력으로 대형 언어 모델 강화하기

초록

멀티모달 대규모 언어 모델은 인공 일반 지능(AGI)으로 나아가는 중요한 단계로 간주되며, ChatGPT의 등장과 함께 상당한 관심을 끌고 있습니다. 그러나 현재의 음성-언어 모델은 일반적으로 캐스케이드 패러다임을 채택하여 모달 간 지식 전달을 방해하고 있습니다. 본 논문에서는 크로스모달 대화 능력을 내재한 대규모 언어 모델인 SpeechGPT를 제안합니다. 이 모델은 멀티모달 콘텐츠를 인지하고 생성할 수 있습니다. 이산적 음성 표현을 사용하여, 우리는 먼저 대규모 크로스모달 음성 명령 데이터셋인 SpeechInstruct를 구축했습니다. 또한, 모달 적응 사전 학습, 크로스모달 명령 미세 조정, 그리고 모달 체인 명령 미세 조정을 포함한 3단계 학습 전략을 채택했습니다. 실험 결과는 SpeechGPT가 멀티모달 인간 명령을 따르는 인상적인 능력을 보여주며, 하나의 모델로 여러 모달리티를 처리할 수 있는 잠재력을 강조합니다. 데모는 https://0nutation.github.io/SpeechGPT.github.io/에서 확인할 수 있습니다.

English

Multi-modal large language models are regarded as a crucial step towards Artificial General Intelligence (AGI) and have garnered significant interest with the emergence of ChatGPT. However, current speech-language models typically adopt the cascade paradigm, preventing inter-modal knowledge transfer. In this paper, we propose SpeechGPT, a large language model with intrinsic cross-modal conversational abilities, capable of perceiving and generating multi-model content. With discrete speech representations, we first construct SpeechInstruct, a large-scale cross-modal speech instruction dataset. Additionally, we employ a three-stage training strategy that includes modality-adaptation pre-training, cross-modal instruction fine-tuning, and chain-of-modality instruction fine-tuning. The experimental results demonstrate that SpeechGPT has an impressive capacity to follow multi-modal human instructions and highlight the potential of handling multiple modalities with one model. Demos are shown in https://0nutation.github.io/SpeechGPT.github.io/.

SpeechGPT: 내재적 크로스모달 대화 능력으로 대형 언어 모델 강화하기

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

초록

Support