PandaGPT：すべての指示に従う単一モデル

要旨

私たちはPandaGPTを紹介します。これは大規模言語モデルに視覚的および聴覚的な指示追従能力を付与するアプローチです。パイロット実験では、PandaGPTが詳細な画像説明の生成、ビデオにインスパイアされたストーリーの作成、音声に関する質問への回答といった複雑なタスクを実行できることが示されました。さらに興味深いことに、PandaGPTはマルチモーダル入力を同時に受け取り、それらの意味を自然に組み合わせることができます。例えば、PandaGPTは画像/ビデオ内の物体の見た目と音声内の音を関連付けることができます。これを実現するため、PandaGPTはImageBindのマルチモーダルエンコーダとVicunaの大規模言語モデルを組み合わせています。注目すべきは、PandaGPTの訓練には整列された画像-テキストペアのみが必要である点です。ImageBindが異なるモダリティのデータを同じ空間に埋め込む強力な能力のおかげで、PandaGPTは画像とテキスト以外のデータ（例えば、ビデオ、音声、深度、熱画像、IMU）に対して新興的、つまりゼロショットのクロスモーダル行動を示します。私たちは、PandaGPTが人間のように異なるモダリティの入力を全体的に知覚し理解できるAGIを構築するための最初の一歩となることを期待しています。プロジェクトページはhttps://panda-gpt.github.io/にあります。

English

We present PandaGPT, an approach to emPower large lANguage moDels with visual and Auditory instruction-following capabilities. Our pilot experiments show that PandaGPT can perform complex tasks such as detailed image description generation, writing stories inspired by videos, and answering questions about audios. More interestingly, PandaGPT can take multimodal inputs simultaneously and compose their semantics naturally. For example, PandaGPT can connect how objects look in an image/video and how they sound in an audio. To do so, PandaGPT combines the multimodal encoders from ImageBind and the large language models from Vicuna. Notably, only aligned image-text pairs are required for the training of PandaGPT. Thanks to the strong capability of ImageBind in embedding data from different modalities into the same space, PandaGPT displays emergent, i.e. zero-shot, cross-modal behaviors for data other than image and text (e.g., video, audio, depth, thermal, and IMU). We hope that PandaGPT serves as an initial step toward building AGI that can perceive and understand inputs in different modalities holistically, as we humans do. Our project page is at https://panda-gpt.github.io/.

PandaGPT：すべての指示に従う単一モデル

PandaGPT: One Model To Instruction-Follow Them All

要旨

Support