Mono-InternVL-1.5：邁向更經濟、更快速的單體多模態大型語言模型

摘要

本文聚焦於單體多模態大語言模型（MLLMs），該模型將視覺編碼與語言解碼整合於單一模型之中。現有的單體MLLM結構與預訓練策略常面臨優化不穩定與災難性遺忘的問題。為應對這些挑戰，我們的核心思路是在預訓練的LLM中嵌入新的視覺參數空間，通過增量調優實現從噪聲數據中穩定學習視覺知識。基於此原則，我們首先提出了Mono-InternVL，這是一種先進的單體MLLM，它通過多模態專家混合架構整合了一組視覺專家。此外，我們為Mono-InternVL設計了一種創新的內生視覺預訓練（EViP），通過漸進學習最大化其視覺能力。Mono-InternVL在與現有MLLM的對比中展現了競爭力，但也伴隨著較高的數據成本。因此，我們進一步推出了Mono-InternVL-1.5，這是一個更經濟且更強大的單體MLLM，配備了改進的EViP（EViP++）。EViP++為Mono-InternVL-1.5引入了額外的視覺注意力專家，並以高效的方式重組了預訓練過程。在推理階段，它包含了一個融合的CUDA內核以加速其MoE操作。憑藉這些設計，Mono-InternVL-1.5顯著降低了訓練與推理成本，同時仍保持了與Mono-InternVL相當的性能。為評估我們的方法，我們在15個基準上進行了廣泛的實驗。結果顯示，Mono-InternVL在15個基準中的12個上超越了現有的單體MLLM，例如在OCRBench上相比Emu3提升了114分。與其模塊化對應版本InternVL-1.5相比，Mono-InternVL-1.5在實現相似多模態性能的同時，將首詞延遲降低了高達69%。代碼與模型已發佈於https://github.com/OpenGVLab/Mono-InternVL。

English

This paper focuses on monolithic Multimodal Large Language Models (MLLMs), which integrate visual encoding and language decoding into a single model. Existing structures and pre-training strategies for monolithic MLLMs often suffer from unstable optimization and catastrophic forgetting. To address these challenges, our key idea is to embed a new visual parameter space into a pre-trained LLM, enabling stable learning of visual knowledge from noisy data via delta tuning. Based on this principle, we first introduce Mono-InternVL, an advanced monolithic MLLM that incorporates a set of visual experts through a multimodal mixture-of-experts architecture. In addition, we design an innovative Endogenous Visual Pre-training (EViP) for Mono-InternVL to maximize its visual capabilities via progressive learning. Mono-InternVL achieves competitive performance against existing MLLMs but also leads to relatively expensive data cost. Therefore, we further present Mono-InternVL-1.5, a cheaper and stronger monolithic MLLM equipped with an improved EViP (EViP++). EViP++ introduces additional visual attention experts to Mono-InternVL-1.5 and re-organizes the pre-training process in an efficient manner. During inference, it includes a fused CUDA kernel to speed up its MoE operations. With these designs, Mono-InternVL-1.5 significantly reduces training and inference costs, while still maintaining competitive performance with Mono-InternVL. To evaluate our approach, we conduct extensive experiments across 15 benchmarks. Results demonstrate that Mono-InternVL outperforms existing monolithic MLLMs on 12 out of 15 benchmarks, e.g., +114-point improvement over Emu3 on OCRBench. Compared to its modular counterpart, i.e., InternVL-1.5, Mono-InternVL-1.5 achieves similar multimodal performance while reducing first-token latency by up to 69%. Code and models are released at https://github.com/OpenGVLab/Mono-InternVL.

Mono-InternVL-1.5：邁向更經濟、更快速的單體多模態大型語言模型

Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

摘要

Support