Mono-InternVL-1.5：迈向更经济、更快速的单体多模态大语言模型

摘要

本文聚焦于一体化多模态大语言模型（MLLMs），这类模型将视觉编码与语言解码整合于单一架构之中。现有的一体化MLLM结构及预训练策略常面临优化不稳定与灾难性遗忘的挑战。针对这些问题，我们的核心思路是在预训练的大语言模型中嵌入一个新的视觉参数空间，通过增量调优实现从噪声数据中稳定学习视觉知识。基于此原则，我们首先推出了Mono-InternVL，这是一种先进的一体化MLLM，它通过多模态专家混合架构集成了一组视觉专家。此外，我们为Mono-InternVL设计了一种创新的内生视觉预训练方法（EViP），通过渐进式学习最大化其视觉能力。Mono-InternVL在性能上可与现有MLLMs媲美，但数据成本相对较高。因此，我们进一步推出了Mono-InternVL-1.5，这是一种成本更低、性能更强的一体化MLLM，配备了改进的EViP（EViP++）。EViP++为Mono-InternVL-1.5引入了额外的视觉注意力专家，并以高效方式重组了预训练过程。在推理阶段，它包含一个融合的CUDA内核，以加速其专家混合操作。凭借这些设计，Mono-InternVL-1.5显著降低了训练与推理成本，同时保持了与Mono-InternVL相当的竞争力。为评估我们的方法，我们在15个基准测试上进行了广泛实验。结果显示，Mono-InternVL在15个基准中的12个上超越了现有的一体化MLLMs，例如在OCRBench上较Emu3提升了114分。与模块化对应模型InternVL-1.5相比，Mono-InternVL-1.5在保持相似多模态性能的同时，首词延迟最多减少了69%。代码与模型已发布于https://github.com/OpenGVLab/Mono-InternVL。

English

This paper focuses on monolithic Multimodal Large Language Models (MLLMs), which integrate visual encoding and language decoding into a single model. Existing structures and pre-training strategies for monolithic MLLMs often suffer from unstable optimization and catastrophic forgetting. To address these challenges, our key idea is to embed a new visual parameter space into a pre-trained LLM, enabling stable learning of visual knowledge from noisy data via delta tuning. Based on this principle, we first introduce Mono-InternVL, an advanced monolithic MLLM that incorporates a set of visual experts through a multimodal mixture-of-experts architecture. In addition, we design an innovative Endogenous Visual Pre-training (EViP) for Mono-InternVL to maximize its visual capabilities via progressive learning. Mono-InternVL achieves competitive performance against existing MLLMs but also leads to relatively expensive data cost. Therefore, we further present Mono-InternVL-1.5, a cheaper and stronger monolithic MLLM equipped with an improved EViP (EViP++). EViP++ introduces additional visual attention experts to Mono-InternVL-1.5 and re-organizes the pre-training process in an efficient manner. During inference, it includes a fused CUDA kernel to speed up its MoE operations. With these designs, Mono-InternVL-1.5 significantly reduces training and inference costs, while still maintaining competitive performance with Mono-InternVL. To evaluate our approach, we conduct extensive experiments across 15 benchmarks. Results demonstrate that Mono-InternVL outperforms existing monolithic MLLMs on 12 out of 15 benchmarks, e.g., +114-point improvement over Emu3 on OCRBench. Compared to its modular counterpart, i.e., InternVL-1.5, Mono-InternVL-1.5 achieves similar multimodal performance while reducing first-token latency by up to 69%. Code and models are released at https://github.com/OpenGVLab/Mono-InternVL.

Mono-InternVL-1.5：迈向更经济、更快速的单体多模态大语言模型

Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

摘要

Support