MiniCPM-V：您手机上的GPT-4V级MLLM

摘要

最近多模态大型语言模型（MLLMs）的激增从根本上改变了人工智能研究和产业的格局，为迈向下一个人工智能里程碑指明了一条充满希望的道路。然而，仍然存在重大挑战阻碍了MLLMs在实际应用中的可行性。最显著的挑战来自运行具有大量参数和广泛计算的MLLM的巨大成本。因此，大多数MLLMs需要部署在高性能云服务器上，这极大地限制了它们的应用范围，如移动、离线、对能源敏感和保护隐私的场景。在这项工作中，我们提出了MiniCPM-V，一系列可部署在端侧设备上的高效MLLMs。通过在架构、预训练和对齐方面整合最新的MLLM技术，最新的MiniCPM-Llama3-V 2.5具有几个显著特点：（1）强大性能，在OpenCompass上胜过GPT-4V-1106、Gemini Pro和Claude 3，这是对11个热门基准测试的全面评估，（2）强大的OCR能力和1.8M像素的高分辨率图像感知，适用于任何纵横比，（3）低幻觉率的可信行为，（4）支持30多种语言的多语言支持，以及（5）在手机上的高效部署。更重要的是，MiniCPM-V可以被视为一个有前途的趋势的代表性示例：为了实现可用性（例如GPT-4V）级别的性能，模型大小正在迅速减小，同时端侧计算能力快速增长。这共同表明，部署在端设备上的GPT-4V级别MLLMs正变得越来越可能，从而在不久的将来打开更广泛的实际人工智能应用领域。

English

The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of parameters and extensive computation. As a result, most MLLMs need to be deployed on high-performing cloud servers, which greatly limits their application scopes such as mobile, offline, energy-sensitive, and privacy-protective scenarios. In this work, we present MiniCPM-V, a series of efficient MLLMs deployable on end-side devices. By integrating the latest MLLM techniques in architecture, pretraining and alignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1) Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strong OCR capability and 1.8M pixel high-resolution image perception at any aspect ratio, (3) trustworthy behavior with low hallucination rates, (4) multilingual support for 30+ languages, and (5) efficient deployment on mobile phones. More importantly, MiniCPM-V can be viewed as a representative example of a promising trend: The model sizes for achieving usable (e.g., GPT-4V) level performance are rapidly decreasing, along with the fast growth of end-side computation capacity. This jointly shows that GPT-4V level MLLMs deployed on end devices are becoming increasingly possible, unlocking a wider spectrum of real-world AI applications in the near future.

MiniCPM-V：您手机上的GPT-4V级MLLM

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

摘要

Support