PandaGPT：一種模型，指導所有指令

摘要

我們提出了PandaGPT，一種結合視覺和聽覺指示跟隨能力的大型語言模型的方法。我們的初步實驗表明，PandaGPT能夠執行複雜任務，如生成詳細的圖像描述、根據視頻創作故事，以及回答有關音頻的問題。更有趣的是，PandaGPT能夠同時接受多模態輸入並自然地組合它們的語義。例如，PandaGPT可以關聯圖像/視頻中物體的外觀和音頻中它們的聲音。為此，PandaGPT結合了ImageBind的多模態編碼器和Vicuna的大型語言模型。值得注意的是，PandaGPT的訓練僅需要對齊的圖像-文本對。由於ImageBind在將來自不同模態的數據嵌入到相同空間的能力強大，PandaGPT展示了新興的、即零樣本的跨模態行為，適用於圖像和文本以外的數據（例如視頻、音頻、深度、熱像和IMU）。我們希望PandaGPT作為邁向構建能夠像人類一樣全面感知和理解不同模態輸入的AGI的初始步驟。我們的項目頁面位於https://panda-gpt.github.io/。

English

We present PandaGPT, an approach to emPower large lANguage moDels with visual and Auditory instruction-following capabilities. Our pilot experiments show that PandaGPT can perform complex tasks such as detailed image description generation, writing stories inspired by videos, and answering questions about audios. More interestingly, PandaGPT can take multimodal inputs simultaneously and compose their semantics naturally. For example, PandaGPT can connect how objects look in an image/video and how they sound in an audio. To do so, PandaGPT combines the multimodal encoders from ImageBind and the large language models from Vicuna. Notably, only aligned image-text pairs are required for the training of PandaGPT. Thanks to the strong capability of ImageBind in embedding data from different modalities into the same space, PandaGPT displays emergent, i.e. zero-shot, cross-modal behaviors for data other than image and text (e.g., video, audio, depth, thermal, and IMU). We hope that PandaGPT serves as an initial step toward building AGI that can perceive and understand inputs in different modalities holistically, as we humans do. Our project page is at https://panda-gpt.github.io/.

PandaGPT：一種模型，指導所有指令

PandaGPT: One Model To Instruction-Follow Them All

摘要

Support