MiniCPM-V 4.5：通過架構、數據與訓練配方烹製高效的多模態大語言模型

摘要

多模態大型語言模型（MLLMs）正經歷快速發展，代表了人工智慧領域的前沿。然而，其訓練與推理效率已成為提升MLLMs可及性與可擴展性的核心瓶頸。為應對這些挑戰，我們推出了MiniCPM-V 4.5，這是一個擁有80億參數的模型，專為高效能與強勁表現而設計。我們在模型架構、數據策略及訓練方法上引入了三大核心改進：針對圖像與視頻高度壓縮編碼的統一3D-Resampler模型架構、無需繁重數據工程的文檔知識與文本識別統一學習範式，以及適用於短長推理模式的混合強化學習策略。OpenCompass評估中的全面實驗結果顯示，MiniCPM-V 4.5超越了廣泛使用的專有模型如GPT-4o-latest，以及規模顯著更大的開源模型如Qwen2.5-VL 72B。值得注意的是，這一強勁表現是在極高效率下實現的。例如，在廣為採用的VideoMME基準測試中，MiniCPM-V 4.5在30B規模以下的模型中達到了頂尖性能，僅消耗了Qwen2.5-VL 7B 46.7%的GPU記憶體成本與8.7%的推理時間。

English

Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core bottleneck in making MLLMs more accessible and scalable. To address the challenges, we present MiniCPM-V 4.5, an 8B parameter model designed for high efficiency and strong performance. We introduce three core improvements in model architecture, data strategy and training method: a unified 3D-Resampler model architecture for highly compact encoding over images and videos, a unified learning paradigm for document knowledge and text recognition without heavy data engineering, and a hybrid reinforcement learning strategy for proficiency in both short and long reasoning modes. Comprehensive experimental results in OpenCompass evaluation show that MiniCPM-V 4.5 surpasses widely used proprietary models such as GPT-4o-latest, and significantly larger open-source models such as Qwen2.5-VL 72B. Notably, the strong performance is achieved with remarkable efficiency. For example, on the widely adopted VideoMME benchmark, MiniCPM-V 4.5 achieves state-of-the-art performance among models under 30B size, using just 46.7\% GPU memory cost and 8.7\% inference time of Qwen2.5-VL 7B.

MiniCPM-V 4.5：通過架構、數據與訓練配方烹製高效的多模態大語言模型

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

摘要

Support