PandaGPT: 모든 명령어 수행을 위한 단일 모델

초록

우리는 대형 언어 모델에 시각 및 청각 명령 수행 능력을 부여하는 PandaGPT 접근법을 소개합니다. 파일럿 실험 결과, PandaGPT는 상세한 이미지 설명 생성, 영상에서 영감을 받은 이야기 작성, 오디오에 대한 질문 답변 등 복잡한 작업을 수행할 수 있음을 보여줍니다. 더 흥미로운 점은 PandaGPT가 다중 모달 입력을 동시에 받아들이고 그 의미를 자연스럽게 조합할 수 있다는 것입니다. 예를 들어, PandaGPT는 이미지/영상에서 객체가 어떻게 보이는지와 오디오에서 어떻게 들리는지를 연결할 수 있습니다. 이를 위해 PandaGPT는 ImageBind의 다중 모달 인코더와 Vicuna의 대형 언어 모델을 결합합니다. 특히, PandaGPT의 학습에는 정렬된 이미지-텍스트 쌍만 필요합니다. ImageBind가 다양한 모달리티의 데이터를 동일한 공간에 임베딩하는 강력한 능력 덕분에, PandaGPT는 이미지와 텍스트 외의 데이터(예: 비디오, 오디오, 깊이, 열화상, IMU)에 대해 제로샷 교차 모달 행동을 보여줍니다. 우리는 PandaGPT가 인간처럼 다양한 모달리티의 입력을 전체적으로 인지하고 이해할 수 있는 AGI를 구축하기 위한 초기 단계로 역할하기를 바랍니다. 프로젝트 페이지는 https://panda-gpt.github.io/에서 확인할 수 있습니다.

English

We present PandaGPT, an approach to emPower large lANguage moDels with visual and Auditory instruction-following capabilities. Our pilot experiments show that PandaGPT can perform complex tasks such as detailed image description generation, writing stories inspired by videos, and answering questions about audios. More interestingly, PandaGPT can take multimodal inputs simultaneously and compose their semantics naturally. For example, PandaGPT can connect how objects look in an image/video and how they sound in an audio. To do so, PandaGPT combines the multimodal encoders from ImageBind and the large language models from Vicuna. Notably, only aligned image-text pairs are required for the training of PandaGPT. Thanks to the strong capability of ImageBind in embedding data from different modalities into the same space, PandaGPT displays emergent, i.e. zero-shot, cross-modal behaviors for data other than image and text (e.g., video, audio, depth, thermal, and IMU). We hope that PandaGPT serves as an initial step toward building AGI that can perceive and understand inputs in different modalities holistically, as we humans do. Our project page is at https://panda-gpt.github.io/.

PandaGPT: 모든 명령어 수행을 위한 단일 모델

PandaGPT: One Model To Instruction-Follow Them All

초록

Support