熊猫GPT：一种模型，可用于遵循所有指令

摘要

我们提出了PandaGPT，一种赋予大型语言模型视觉和听觉指令跟随能力的方法。我们的试验表明，PandaGPT能够执行复杂任务，如生成详细的图像描述、根据视频撰写故事以及回答关于音频的问题。更有趣的是，PandaGPT能够同时接收多模态输入并自然地组合它们的语义。例如，PandaGPT能够连接图像/视频中物体的外观和音频中它们的声音。为此，PandaGPT结合了来自ImageBind的多模态编码器和来自Vicuna的大型语言模型。值得注意的是，PandaGPT训练仅需要对齐的图像-文本对。由于ImageBind在将来自不同模态的数据嵌入到相同空间方面的强大能力，PandaGPT展现出了新兴的、即零样本的跨模态行为，适用于除图像和文本之外的数据（例如视频、音频、深度、热像和IMU）。我们希望PandaGPT作为构建能够像人类一样全面感知和理解不同模态输入的AGI的初始步骤。我们的项目页面位于https://panda-gpt.github.io/。

English

We present PandaGPT, an approach to emPower large lANguage moDels with visual and Auditory instruction-following capabilities. Our pilot experiments show that PandaGPT can perform complex tasks such as detailed image description generation, writing stories inspired by videos, and answering questions about audios. More interestingly, PandaGPT can take multimodal inputs simultaneously and compose their semantics naturally. For example, PandaGPT can connect how objects look in an image/video and how they sound in an audio. To do so, PandaGPT combines the multimodal encoders from ImageBind and the large language models from Vicuna. Notably, only aligned image-text pairs are required for the training of PandaGPT. Thanks to the strong capability of ImageBind in embedding data from different modalities into the same space, PandaGPT displays emergent, i.e. zero-shot, cross-modal behaviors for data other than image and text (e.g., video, audio, depth, thermal, and IMU). We hope that PandaGPT serves as an initial step toward building AGI that can perceive and understand inputs in different modalities holistically, as we humans do. Our project page is at https://panda-gpt.github.io/.

熊猫GPT：一种模型，可用于遵循所有指令

PandaGPT: One Model To Instruction-Follow Them All

摘要

Support