MiniCPM-V 4.5: アーキテクチャ、データ、トレーニングレシピによる効率的なMLLMの構築

要旨

マルチモーダル大規模言語モデル（MLLMs）は急速に進化しており、AI開発の最前線を代表する存在となっている。しかし、その訓練と推論の効率性が、MLLMsをよりアクセス可能かつスケーラブルにする上での核心的なボトルネックとして浮上している。この課題に対処するため、我々は高効率かつ強力な性能を目指した8BパラメータモデルであるMiniCPM-V 4.5を提案する。本モデルでは、モデルアーキテクチャ、データ戦略、および訓練方法において3つの核心的な改善を導入した。具体的には、画像と動画に対する高度にコンパクトなエンコーディングを実現する統合型3D-Resamplerモデルアーキテクチャ、重厚なデータエンジニアリングを必要とせずに文書知識とテキスト認識を統合的に学習するパラダイム、そして短い推論モードと長い推論モードの両方に熟達するためのハイブリッド強化学習戦略である。OpenCompass評価における包括的な実験結果は、MiniCPM-V 4.5がGPT-4o-latestなどの広く使用されているプロプライエタリモデルや、Qwen2.5-VL 72Bなどの大幅に大規模なオープンソースモデルを凌駕することを示している。特に、この強力な性能は顕著な効率性とともに達成されている。例えば、広く採用されているVideoMMEベンチマークにおいて、MiniCPM-V 4.5は30Bサイズ以下のモデルの中で最先端の性能を達成し、Qwen2.5-VL 7Bの46.7%のGPUメモリコストと8.7%の推論時間しか使用していない。

English

Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core bottleneck in making MLLMs more accessible and scalable. To address the challenges, we present MiniCPM-V 4.5, an 8B parameter model designed for high efficiency and strong performance. We introduce three core improvements in model architecture, data strategy and training method: a unified 3D-Resampler model architecture for highly compact encoding over images and videos, a unified learning paradigm for document knowledge and text recognition without heavy data engineering, and a hybrid reinforcement learning strategy for proficiency in both short and long reasoning modes. Comprehensive experimental results in OpenCompass evaluation show that MiniCPM-V 4.5 surpasses widely used proprietary models such as GPT-4o-latest, and significantly larger open-source models such as Qwen2.5-VL 72B. Notably, the strong performance is achieved with remarkable efficiency. For example, on the widely adopted VideoMME benchmark, MiniCPM-V 4.5 achieves state-of-the-art performance among models under 30B size, using just 46.7\% GPU memory cost and 8.7\% inference time of Qwen2.5-VL 7B.

MiniCPM-V 4.5: アーキテクチャ、データ、トレーニングレシピによる効率的なMLLMの構築

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

要旨

Support