OneLLM：一個將所有模態與語言對齊的框架

摘要

多模式大型語言模型（MLLMs）因其強大的多模式理解能力而受到重視。然而，現有研究在很大程度上依賴於特定模式的編碼器，這些編碼器通常在架構上有所不同，並且僅限於常見的模式。在本文中，我們提出了OneLLM，一種將八種模式與語言對齊的MLLM，並使用統一框架。我們通過統一的多模式編碼器和漸進式多模式對齊流程來實現這一目標。具體而言，我們首先訓練一個影像投影模組，將視覺編碼器與LLM相連接。然後，我們通過混合多個影像投影模組和動態路由來構建通用投影模組（UPM）。最後，我們使用UPM逐步將更多模式對齊到LLM。為了充分發揮OneLLM在遵循指示方面的潛力，我們還精心策劃了一個包括來自圖像、音頻、視頻、點雲、深度/法線圖、IMU和fMRI腦部活動的200萬項目的全面多模式指示數據集。OneLLM在25個不同的基準測試中進行評估，涵蓋多模式字幕、問答和推理等任務，表現出優異的性能。代碼、數據、模型和在線演示可在https://github.com/csuhan/OneLLM 上找到。

English

Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail, we first train an image projection module to connect a vision encoder with LLM. Then, we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally, we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions, we also curated a comprehensive multimodal instruction dataset, including 2M items from image, audio, video, point cloud, depth/normal map, IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning, where it delivers excellent performance. Code, data, model and online demo are available at https://github.com/csuhan/OneLLM

OneLLM：一個將所有模態與語言對齊的框架

OneLLM: One Framework to Align All Modalities with Language

摘要

Support