Mono-InternVL-1.5：より低コストで高速なモノリシックマルチモーダル大規模言語モデルに向けて

要旨

本論文は、視覚符号化と言語復号を単一のモデルに統合したモノリシック型マルチモーダル大規模言語モデル（MLLM）に焦点を当てる。既存のモノリシックMLLMの構造と事前学習戦略は、不安定な最適化や破滅的忘却に悩まされることが多い。これらの課題に対処するため、我々の鍵となるアイデアは、事前学習済みのLLMに新しい視覚パラメータ空間を埋め込み、ノイズの多いデータから視覚知識を安定して学習するためにデルタチューニングを活用することである。この原理に基づき、まず、マルチモーダルエキスパート混合アーキテクチャを通じて一連の視覚エキスパートを組み込んだ高度なモノリシックMLLMであるMono-InternVLを導入する。さらに、Mono-InternVLの視覚能力を最大化するために、革新的な内生的視覚事前学習（EViP）を設計し、段階的学習を通じてその能力を向上させる。Mono-InternVLは既存のMLLMに対して競争力のある性能を発揮するが、データコストが比較的高い。そこで、改良されたEViP（EViP++）を備えた、より安価で強力なモノリシックMLLMであるMono-InternVL-1.5をさらに提示する。EViP++は、Mono-InternVL-1.5に追加の視覚注意エキスパートを導入し、事前学習プロセスを効率的に再編成する。推論時には、MoE操作を高速化するための融合CUDAカーネルを含む。これらの設計により、Mono-InternVL-1.5は学習と推論のコストを大幅に削減しつつ、Mono-InternVLと同等の競争力のある性能を維持する。我々のアプローチを評価するため、15のベンチマークで広範な実験を実施した。結果は、Mono-InternVLが15のベンチマークのうち12で既存のモノリシックMLLMを上回り、例えばOCRBenchではEmu3に対して114ポイントの改善を示した。モジュール型の対応モデルであるInternVL-1.5と比較して、Mono-InternVL-1.5は同様のマルチモーダル性能を達成しつつ、初回トークンのレイテンシを最大69％削減した。コードとモデルはhttps://github.com/OpenGVLab/Mono-InternVLで公開されている。

English

This paper focuses on monolithic Multimodal Large Language Models (MLLMs), which integrate visual encoding and language decoding into a single model. Existing structures and pre-training strategies for monolithic MLLMs often suffer from unstable optimization and catastrophic forgetting. To address these challenges, our key idea is to embed a new visual parameter space into a pre-trained LLM, enabling stable learning of visual knowledge from noisy data via delta tuning. Based on this principle, we first introduce Mono-InternVL, an advanced monolithic MLLM that incorporates a set of visual experts through a multimodal mixture-of-experts architecture. In addition, we design an innovative Endogenous Visual Pre-training (EViP) for Mono-InternVL to maximize its visual capabilities via progressive learning. Mono-InternVL achieves competitive performance against existing MLLMs but also leads to relatively expensive data cost. Therefore, we further present Mono-InternVL-1.5, a cheaper and stronger monolithic MLLM equipped with an improved EViP (EViP++). EViP++ introduces additional visual attention experts to Mono-InternVL-1.5 and re-organizes the pre-training process in an efficient manner. During inference, it includes a fused CUDA kernel to speed up its MoE operations. With these designs, Mono-InternVL-1.5 significantly reduces training and inference costs, while still maintaining competitive performance with Mono-InternVL. To evaluate our approach, we conduct extensive experiments across 15 benchmarks. Results demonstrate that Mono-InternVL outperforms existing monolithic MLLMs on 12 out of 15 benchmarks, e.g., +114-point improvement over Emu3 on OCRBench. Compared to its modular counterpart, i.e., InternVL-1.5, Mono-InternVL-1.5 achieves similar multimodal performance while reducing first-token latency by up to 69%. Code and models are released at https://github.com/OpenGVLab/Mono-InternVL.

Mono-InternVL-1.5：より低コストで高速なモノリシックマルチモーダル大規模言語モデルに向けて

Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

要旨

Support