OneLLM: 모든 모달리티를 언어와 정렬하는 통합 프레임워크

초록

멀티모달 대형 언어 모델(MLLMs)은 강력한 멀티모달 이해 능력으로 인해 상당한 주목을 받고 있습니다. 그러나 기존 연구는 주로 모달리티별 인코더에 크게 의존하며, 이러한 인코더들은 일반적으로 아키텍처가 다르고 일반적인 모달리티에만 제한되어 있습니다. 본 논문에서는 통합 프레임워크를 사용하여 8가지 모달리티를 언어와 정렬하는 MLLM인 OneLLM을 제안합니다. 이를 위해 통합 멀티모달 인코더와 점진적 멀티모달 정렬 파이프라인을 활용합니다. 구체적으로, 먼저 비전 인코더와 LLM을 연결하기 위한 이미지 프로젝션 모듈을 학습합니다. 그런 다음, 여러 이미지 프로젝션 모듈과 동적 라우팅을 혼합하여 범용 프로젝션 모듈(UPM)을 구축합니다. 마지막으로, UPM을 사용하여 더 많은 모달리티를 LLM에 점진적으로 정렬합니다. OneLLM의 명령 수행 잠재력을 최대한 활용하기 위해, 이미지, 오디오, 비디오, 포인트 클라우드, 깊이/노멀 맵, IMU 및 fMRI 뇌 활동을 포함한 2백만 개의 항목으로 구성된 포괄적인 멀티모달 명령 데이터셋을 구축했습니다. OneLLM은 멀티모달 캡셔닝, 질문 응답 및 추론과 같은 다양한 작업을 포함한 25개의 벤치마크에서 평가되었으며, 우수한 성능을 보여줍니다. 코드, 데이터, 모델 및 온라인 데모는 https://github.com/csuhan/OneLLM에서 확인할 수 있습니다.

English

Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail, we first train an image projection module to connect a vision encoder with LLM. Then, we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally, we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions, we also curated a comprehensive multimodal instruction dataset, including 2M items from image, audio, video, point cloud, depth/normal map, IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning, where it delivers excellent performance. Code, data, model and online demo are available at https://github.com/csuhan/OneLLM

OneLLM: 모든 모달리티를 언어와 정렬하는 통합 프레임워크

OneLLM: One Framework to Align All Modalities with Language

초록

Support