OneLLM:一個將所有模態與語言對齊的框架
OneLLM: One Framework to Align All Modalities with Language
December 6, 2023
作者: Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, Xiangyu Yue
cs.AI
摘要
多模式大型語言模型(MLLMs)因其強大的多模式理解能力而受到重視。然而,現有研究在很大程度上依賴於特定模式的編碼器,這些編碼器通常在架構上有所不同,並且僅限於常見的模式。在本文中,我們提出了OneLLM,一種將八種模式與語言對齊的MLLM,並使用統一框架。我們通過統一的多模式編碼器和漸進式多模式對齊流程來實現這一目標。具體而言,我們首先訓練一個影像投影模組,將視覺編碼器與LLM相連接。然後,我們通過混合多個影像投影模組和動態路由來構建通用投影模組(UPM)。最後,我們使用UPM逐步將更多模式對齊到LLM。為了充分發揮OneLLM在遵循指示方面的潛力,我們還精心策劃了一個包括來自圖像、音頻、視頻、點雲、深度/法線圖、IMU和fMRI腦部活動的200萬項目的全面多模式指示數據集。OneLLM在25個不同的基準測試中進行評估,涵蓋多模式字幕、問答和推理等任務,表現出優異的性能。代碼、數據、模型和在線演示可在https://github.com/csuhan/OneLLM 上找到。
English
Multimodal large language models (MLLMs) have gained significant attention
due to their strong multimodal understanding capability. However, existing
works rely heavily on modality-specific encoders, which usually differ in
architecture and are limited to common modalities. In this paper, we present
OneLLM, an MLLM that aligns eight modalities to language using a unified
framework. We achieve this through a unified multimodal encoder and a
progressive multimodal alignment pipeline. In detail, we first train an image
projection module to connect a vision encoder with LLM. Then, we build a
universal projection module (UPM) by mixing multiple image projection modules
and dynamic routing. Finally, we progressively align more modalities to LLM
with the UPM. To fully leverage the potential of OneLLM in following
instructions, we also curated a comprehensive multimodal instruction dataset,
including 2M items from image, audio, video, point cloud, depth/normal map, IMU
and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks,
encompassing tasks such as multimodal captioning, question answering and
reasoning, where it delivers excellent performance. Code, data, model and
online demo are available at https://github.com/csuhan/OneLLM