NExT-GPT: 다중모달리티 LLM을 위한 임의-대-임의 변환 시스템

초록

최근 멀티모달 대형 언어 모델(MM-LLMs)이 놀라운 발전을 이루었지만, 대부분 입력 측면의 멀티모달 이해에만 국한되어 다양한 모달리티로 콘텐츠를 생성하는 능력은 부족한 한계를 보여왔습니다. 우리 인간은 항상 다양한 모달리티를 통해 세상을 인지하고 타인과 소통하기 때문에, 어떤 모달리티든 수용하고 전달할 수 있는 any-to-any MM-LLMs를 개발하는 것은 인간 수준의 AI를 위해 필수적입니다. 이러한 격차를 메우기 위해, 우리는 end-to-end 범용 any-to-any MM-LLM 시스템인 NExT-GPT를 제안합니다. NExT-GPT는 LLM을 멀티모달 어댑터와 다양한 디퓨전 디코더와 연결하여 텍스트, 이미지, 비디오, 오디오의 임의 조합으로 입력을 인지하고 출력을 생성할 수 있도록 합니다. 기존에 잘 훈련된 고성능 인코더와 디코더를 활용함으로써, NExT-GPT는 특정 투영 계층의 매개변수 중 단 1%만으로 조정되며, 이는 저비용 훈련을 가능하게 할 뿐만 아니라 잠재적인 추가 모달리티로의 편리한 확장을 촉진합니다. 더불어, 우리는 모달리티 전환 지시 튜닝(MosIT)을 도입하고 이를 위해 고품질 데이터셋을 수작업으로 구축함으로써, NExT-GPT가 복잡한 교차 모달 의미 이해와 콘텐츠 생성 능력을 갖추도록 했습니다. 전반적으로, 본 연구는 보편적인 모달리티를 모델링할 수 있는 AI 에이전트를 구축할 가능성을 보여주며, 커뮤니티에서 더욱 인간다운 AI 연구를 위한 길을 열었습니다.

English

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community.

NExT-GPT: 다중모달리티 LLM을 위한 임의-대-임의 변환 시스템

NExT-GPT: Any-to-Any Multimodal LLM

초록

Support