OneLLM：全てのモダリティと言語を統合する単一フレームワーク

要旨

マルチモーダル大規模言語モデル（MLLM）は、その強力なマルチモーダル理解能力により、大きな注目を集めています。しかし、既存の研究はモダリティ固有のエンコーダに大きく依存しており、これらのエンコーダは通常、アーキテクチャが異なり、一般的なモダリティに限定されています。本論文では、8つのモダリティを言語に統一的なフレームワークで整列させるMLLMであるOneLLMを提案します。これを実現するために、統一されたマルチモーダルエンコーダと段階的なマルチモーダル整列パイプラインを採用しています。具体的には、まず視覚エンコーダとLLMを接続するための画像投影モジュールを訓練します。次に、複数の画像投影モジュールと動的ルーティングを組み合わせて、ユニバーサル投影モジュール（UPM）を構築します。最後に、UPMを使用して、より多くのモダリティをLLMに段階的に整列させます。OneLLMの指示追従能力を最大限に活用するために、画像、音声、動画、点群、深度/法線マップ、IMU、fMRI脳活動を含む2M項目からなる包括的なマルチモーダル指示データセットも作成しました。OneLLMは、マルチモーダルキャプショニング、質問応答、推論などのタスクを含む25の多様なベンチマークで評価され、優れた性能を発揮しています。コード、データ、モデル、オンラインデモはhttps://github.com/csuhan/OneLLMで公開されています。

English

Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail, we first train an image projection module to connect a vision encoder with LLM. Then, we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally, we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions, we also curated a comprehensive multimodal instruction dataset, including 2M items from image, audio, video, point cloud, depth/normal map, IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning, where it delivers excellent performance. Code, data, model and online demo are available at https://github.com/csuhan/OneLLM

OneLLM：全てのモダリティと言語を統合する単一フレームワーク

OneLLM: One Framework to Align All Modalities with Language

要旨

Support