MiniCPM-V 4.5：通过架构、数据与训练方案打造高效多模态大语言模型

摘要

多模态大语言模型（MLLMs）正经历快速发展，代表着人工智能领域的前沿。然而，其训练与推理效率已成为提升MLLMs普及性和可扩展性的核心瓶颈。为应对这些挑战，我们推出了MiniCPM-V 4.5，这是一款拥有80亿参数的高效高性能模型。我们在模型架构、数据策略及训练方法上引入了三大核心改进：采用统一的三维重采样器模型架构，实现对图像和视频的高度紧凑编码；提出无需繁重数据工程的文档知识与文本识别统一学习范式；以及采用混合强化学习策略，确保模型在短程与长程推理模式中均表现优异。在OpenCompass评估中的全面实验结果显示，MiniCPM-V 4.5不仅超越了广泛使用的专有模型如GPT-4o最新版，还显著优于规模更大的开源模型如Qwen2.5-VL 72B。尤为值得一提的是，这一卓越性能是在极高效率下实现的。例如，在广泛采用的VideoMME基准测试中，MiniCPM-V 4.5在30B规模以下的模型中达到了顶尖性能，仅消耗了Qwen2.5-VL 7B 46.7%的GPU内存和8.7%的推理时间。

English

Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core bottleneck in making MLLMs more accessible and scalable. To address the challenges, we present MiniCPM-V 4.5, an 8B parameter model designed for high efficiency and strong performance. We introduce three core improvements in model architecture, data strategy and training method: a unified 3D-Resampler model architecture for highly compact encoding over images and videos, a unified learning paradigm for document knowledge and text recognition without heavy data engineering, and a hybrid reinforcement learning strategy for proficiency in both short and long reasoning modes. Comprehensive experimental results in OpenCompass evaluation show that MiniCPM-V 4.5 surpasses widely used proprietary models such as GPT-4o-latest, and significantly larger open-source models such as Qwen2.5-VL 72B. Notably, the strong performance is achieved with remarkable efficiency. For example, on the widely adopted VideoMME benchmark, MiniCPM-V 4.5 achieves state-of-the-art performance among models under 30B size, using just 46.7\% GPU memory cost and 8.7\% inference time of Qwen2.5-VL 7B.

MiniCPM-V 4.5：通过架构、数据与训练方案打造高效多模态大语言模型

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

摘要

Support