Mono-InternVL-1.5:邁向更經濟、更快速的單體多模態大型語言模型
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
July 16, 2025
作者: Gen Luo, Wenhan Dou, Wenhao Li, Zhaokai Wang, Xue Yang, Changyao Tian, Hao Li, Weiyun Wang, Wenhai Wang, Xizhou Zhu, Yu Qiao, Jifeng Dai
cs.AI
摘要
本文聚焦於單體多模態大語言模型(MLLMs),該模型將視覺編碼與語言解碼整合於單一模型之中。現有的單體MLLM結構與預訓練策略常面臨優化不穩定與災難性遺忘的問題。為應對這些挑戰,我們的核心思路是在預訓練的LLM中嵌入新的視覺參數空間,通過增量調優實現從噪聲數據中穩定學習視覺知識。基於此原則,我們首先提出了Mono-InternVL,這是一種先進的單體MLLM,它通過多模態專家混合架構整合了一組視覺專家。此外,我們為Mono-InternVL設計了一種創新的內生視覺預訓練(EViP),通過漸進學習最大化其視覺能力。Mono-InternVL在與現有MLLM的對比中展現了競爭力,但也伴隨著較高的數據成本。因此,我們進一步推出了Mono-InternVL-1.5,這是一個更經濟且更強大的單體MLLM,配備了改進的EViP(EViP++)。EViP++為Mono-InternVL-1.5引入了額外的視覺注意力專家,並以高效的方式重組了預訓練過程。在推理階段,它包含了一個融合的CUDA內核以加速其MoE操作。憑藉這些設計,Mono-InternVL-1.5顯著降低了訓練與推理成本,同時仍保持了與Mono-InternVL相當的性能。為評估我們的方法,我們在15個基準上進行了廣泛的實驗。結果顯示,Mono-InternVL在15個基準中的12個上超越了現有的單體MLLM,例如在OCRBench上相比Emu3提升了114分。與其模塊化對應版本InternVL-1.5相比,Mono-InternVL-1.5在實現相似多模態性能的同時,將首詞延遲降低了高達69%。代碼與模型已發佈於https://github.com/OpenGVLab/Mono-InternVL。
English
This paper focuses on monolithic Multimodal Large Language Models (MLLMs),
which integrate visual encoding and language decoding into a single model.
Existing structures and pre-training strategies for monolithic MLLMs often
suffer from unstable optimization and catastrophic forgetting. To address these
challenges, our key idea is to embed a new visual parameter space into a
pre-trained LLM, enabling stable learning of visual knowledge from noisy data
via delta tuning. Based on this principle, we first introduce Mono-InternVL, an
advanced monolithic MLLM that incorporates a set of visual experts through a
multimodal mixture-of-experts architecture. In addition, we design an
innovative Endogenous Visual Pre-training (EViP) for Mono-InternVL to maximize
its visual capabilities via progressive learning. Mono-InternVL achieves
competitive performance against existing MLLMs but also leads to relatively
expensive data cost. Therefore, we further present Mono-InternVL-1.5, a cheaper
and stronger monolithic MLLM equipped with an improved EViP (EViP++). EViP++
introduces additional visual attention experts to Mono-InternVL-1.5 and
re-organizes the pre-training process in an efficient manner. During inference,
it includes a fused CUDA kernel to speed up its MoE operations. With these
designs, Mono-InternVL-1.5 significantly reduces training and inference costs,
while still maintaining competitive performance with Mono-InternVL. To evaluate
our approach, we conduct extensive experiments across 15 benchmarks. Results
demonstrate that Mono-InternVL outperforms existing monolithic MLLMs on 12 out
of 15 benchmarks, e.g., +114-point improvement over Emu3 on OCRBench. Compared
to its modular counterpart, i.e., InternVL-1.5, Mono-InternVL-1.5 achieves
similar multimodal performance while reducing first-token latency by up to 69%.
Code and models are released at https://github.com/OpenGVLab/Mono-InternVL.