OneLLM：一个框架，用于将所有模态与语言对齐。

摘要

由于其强大的多模态理解能力，多模态大型语言模型（MLLMs）受到了广泛关注。然而，现有研究在很大程度上依赖于特定模态的编码器，这些编码器通常在架构上有所不同，并且仅限于常见模态。在本文中，我们提出了OneLLM，这是一种将八种模态与语言对齐的MLLM，采用统一框架实现。我们通过统一的多模态编码器和渐进式多模态对齐流程来实现这一目标。具体而言，我们首先训练一个图像投影模块，将视觉编码器与LLM连接起来。然后，我们通过混合多个图像投影模块和动态路由构建了一个通用投影模块（UPM）。最后，我们使用UPM逐步将更多模态与LLM对齐。为充分发挥OneLLM在遵循指令方面的潜力，我们还精心策划了一个包括来自图像、音频、视频、点云、深度/法线图、IMU和fMRI脑活动的综合多模态指令数据集，共包括200万条数据。OneLLM在25个不同的基准测试中进行了评估，涵盖了多模态字幕生成、问答和推理等任务，在这些任务中表现出色。代码、数据、模型和在线演示可在https://github.com/csuhan/OneLLM 上找到。

English

Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail, we first train an image projection module to connect a vision encoder with LLM. Then, we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally, we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions, we also curated a comprehensive multimodal instruction dataset, including 2M items from image, audio, video, point cloud, depth/normal map, IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning, where it delivers excellent performance. Code, data, model and online demo are available at https://github.com/csuhan/OneLLM

OneLLM：一个框架，用于将所有模态与语言对齐。

OneLLM: One Framework to Align All Modalities with Language

摘要

Support