MiniCPM-V 4.5:通过架构、数据与训练方案打造高效多模态大语言模型
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
September 16, 2025
作者: Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, Bokai Xu, Junbo Cui, Yingjing Xu, Liqing Ruan, Luoyuan Zhang, Hanyu Liu, Jingkun Tang, Hongyuan Liu, Qining Guo, Wenhao Hu, Bingxiang He, Jie Zhou, Jie Cai, Ji Qi, Zonghao Guo, Chi Chen, Guoyang Zeng, Yuxuan Li, Ganqu Cui, Ning Ding, Xu Han, Yuan Yao, Zhiyuan Liu, Maosong Sun
cs.AI
摘要
多模态大语言模型(MLLMs)正经历快速发展,代表着人工智能领域的前沿。然而,其训练与推理效率已成为提升MLLMs普及性和可扩展性的核心瓶颈。为应对这些挑战,我们推出了MiniCPM-V 4.5,这是一款拥有80亿参数的高效高性能模型。我们在模型架构、数据策略及训练方法上引入了三大核心改进:采用统一的三维重采样器模型架构,实现对图像和视频的高度紧凑编码;提出无需繁重数据工程的文档知识与文本识别统一学习范式;以及采用混合强化学习策略,确保模型在短程与长程推理模式中均表现优异。在OpenCompass评估中的全面实验结果显示,MiniCPM-V 4.5不仅超越了广泛使用的专有模型如GPT-4o最新版,还显著优于规模更大的开源模型如Qwen2.5-VL 72B。尤为值得一提的是,这一卓越性能是在极高效率下实现的。例如,在广泛采用的VideoMME基准测试中,MiniCPM-V 4.5在30B规模以下的模型中达到了顶尖性能,仅消耗了Qwen2.5-VL 7B 46.7%的GPU内存和8.7%的推理时间。
English
Multimodal Large Language Models (MLLMs) are undergoing rapid progress and
represent the frontier of AI development. However, their training and inference
efficiency have emerged as a core bottleneck in making MLLMs more accessible
and scalable. To address the challenges, we present MiniCPM-V 4.5, an 8B
parameter model designed for high efficiency and strong performance. We
introduce three core improvements in model architecture, data strategy and
training method: a unified 3D-Resampler model architecture for highly compact
encoding over images and videos, a unified learning paradigm for document
knowledge and text recognition without heavy data engineering, and a hybrid
reinforcement learning strategy for proficiency in both short and long
reasoning modes. Comprehensive experimental results in OpenCompass evaluation
show that MiniCPM-V 4.5 surpasses widely used proprietary models such as
GPT-4o-latest, and significantly larger open-source models such as Qwen2.5-VL
72B. Notably, the strong performance is achieved with remarkable efficiency.
For example, on the widely adopted VideoMME benchmark, MiniCPM-V 4.5 achieves
state-of-the-art performance among models under 30B size, using just 46.7\% GPU
memory cost and 8.7\% inference time of Qwen2.5-VL 7B.