Mono-InternVL-1.5: 더 저렴하고 빠른 단일체 다중모달 대규모 언어 모델을 향하여

초록

본 논문은 시각 인코딩과 언어 디코딩을 단일 모델로 통합한 모놀리식 멀티모달 대형 언어 모델(MLLM)에 초점을 맞춥니다. 기존의 모놀리식 MLLM 구조와 사전 학습 전략은 불안정한 최적화와 치명적 망각 문제를 겪는 경우가 많습니다. 이러한 문제를 해결하기 위해, 우리의 핵심 아이디어는 사전 학습된 LLM에 새로운 시각 매개변수 공간을 내장시켜, 델타 튜닝을 통해 노이즈가 있는 데이터로부터 시각 지식을 안정적으로 학습할 수 있도록 하는 것입니다. 이 원리에 기반하여, 우리는 먼저 멀티모달 전문가 혼합(MoE) 아키텍처를 통해 일련의 시각 전문가를 통합한 고급 모놀리식 MLLM인 Mono-InternVL을 소개합니다. 또한, Mono-InternVL의 시각 능력을 극대화하기 위해 점진적 학습을 통한 혁신적인 내생적 시각 사전 학습(Endogenous Visual Pre-training, EViP)을 설계했습니다. Mono-InternVL은 기존 MLLM 대비 경쟁력 있는 성능을 달성했지만, 상대적으로 높은 데이터 비용이 발생합니다. 따라서, 우리는 더 저렴하고 강력한 모놀리식 MLLM인 Mono-InternVL-1.5를 제안하며, 여기에는 개선된 EViP(EViP++)가 적용되었습니다. EViP++는 Mono-InternVL-1.5에 추가적인 시각 주의 전문가를 도입하고, 사전 학습 과정을 효율적으로 재구성합니다. 추론 과정에서는 MoE 연산을 가속화하기 위해 융합된 CUDA 커널을 포함시켰습니다. 이러한 설계를 통해 Mono-InternVL-1.5는 학습 및 추론 비용을 크게 절감하면서도 Mono-InternVL과 경쟁력 있는 성능을 유지합니다. 우리의 접근 방식을 평가하기 위해, 15개의 벤치마크에 걸쳐 광범위한 실험을 수행했습니다. 실험 결과, Mono-InternVL은 15개 벤치마크 중 12개에서 기존 모놀리식 MLLM을 능가했으며, 예를 들어 OCRBench에서 Emu3 대비 +114점의 성능 향상을 보였습니다. 모듈식 대응 모델인 InternVL-1.5와 비교했을 때, Mono-InternVL-1.5는 유사한 멀티모달 성능을 달성하면서 첫 토큰 지연 시간을 최대 69%까지 줄였습니다. 코드와 모델은 https://github.com/OpenGVLab/Mono-InternVL에서 공개되었습니다.

English

This paper focuses on monolithic Multimodal Large Language Models (MLLMs), which integrate visual encoding and language decoding into a single model. Existing structures and pre-training strategies for monolithic MLLMs often suffer from unstable optimization and catastrophic forgetting. To address these challenges, our key idea is to embed a new visual parameter space into a pre-trained LLM, enabling stable learning of visual knowledge from noisy data via delta tuning. Based on this principle, we first introduce Mono-InternVL, an advanced monolithic MLLM that incorporates a set of visual experts through a multimodal mixture-of-experts architecture. In addition, we design an innovative Endogenous Visual Pre-training (EViP) for Mono-InternVL to maximize its visual capabilities via progressive learning. Mono-InternVL achieves competitive performance against existing MLLMs but also leads to relatively expensive data cost. Therefore, we further present Mono-InternVL-1.5, a cheaper and stronger monolithic MLLM equipped with an improved EViP (EViP++). EViP++ introduces additional visual attention experts to Mono-InternVL-1.5 and re-organizes the pre-training process in an efficient manner. During inference, it includes a fused CUDA kernel to speed up its MoE operations. With these designs, Mono-InternVL-1.5 significantly reduces training and inference costs, while still maintaining competitive performance with Mono-InternVL. To evaluate our approach, we conduct extensive experiments across 15 benchmarks. Results demonstrate that Mono-InternVL outperforms existing monolithic MLLMs on 12 out of 15 benchmarks, e.g., +114-point improvement over Emu3 on OCRBench. Compared to its modular counterpart, i.e., InternVL-1.5, Mono-InternVL-1.5 achieves similar multimodal performance while reducing first-token latency by up to 69%. Code and models are released at https://github.com/OpenGVLab/Mono-InternVL.

Mono-InternVL-1.5: 더 저렴하고 빠른 단일체 다중모달 대규모 언어 모델을 향하여

Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

초록

Support