MiniCPM-V：在您的手機上運行的 GPT-4V 級別 MLLM

摘要

最近多模式大型語言模型（MLLMs）的激增徹底改變了人工智慧研究和產業的格局，為邁向下一個人工智慧里程碑指明了一條充滿希望的道路。然而，仍然存在著重大挑戰，阻礙了MLLMs在實際應用中的可行性。其中最引人注目的挑戰來自運行具有龐大參數和龐大計算量的MLLM所需的巨大成本。因此，大多數MLLMs需要部署在高性能的雲伺服器上，這大大限制了它們的應用範圍，如移動、離線、對能源敏感和保護隱私的情境。在這項工作中，我們提出了MiniCPM-V，這是一系列可部署在端設備上的高效MLLMs。通過在架構、預訓練和對齊方面整合最新的MLLM技術，最新的MiniCPM-Llama3-V 2.5 具有幾個顯著特點：（1）強大的性能，在OpenCompass上優於GPT-4V-1106、Gemini Pro和Claude 3，這是對11個熱門基準測試的全面評估，（2）強大的OCR能力和對任何長寬比的180萬像素高分辨率圖像感知，（3）低幻覺率的值得信賴的行為，（4）支持30多種語言的多語言支持，以及（5）在移動手機上的高效部署。更重要的是，MiniCPM-V可以被視為一個有前途的趨勢的代表性例子：實現可用性（例如GPT-4V）級別性能所需的模型大小正在迅速減小，與端設備計算能力的快速增長相呼應。這共同顯示，GPT-4V級別的MLLMs部署在端設備上正變得越來越可能，很快將在未來解鎖更廣泛的實際人工智慧應用領域。

English

The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of parameters and extensive computation. As a result, most MLLMs need to be deployed on high-performing cloud servers, which greatly limits their application scopes such as mobile, offline, energy-sensitive, and privacy-protective scenarios. In this work, we present MiniCPM-V, a series of efficient MLLMs deployable on end-side devices. By integrating the latest MLLM techniques in architecture, pretraining and alignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1) Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strong OCR capability and 1.8M pixel high-resolution image perception at any aspect ratio, (3) trustworthy behavior with low hallucination rates, (4) multilingual support for 30+ languages, and (5) efficient deployment on mobile phones. More importantly, MiniCPM-V can be viewed as a representative example of a promising trend: The model sizes for achieving usable (e.g., GPT-4V) level performance are rapidly decreasing, along with the fast growth of end-side computation capacity. This jointly shows that GPT-4V level MLLMs deployed on end devices are becoming increasingly possible, unlocking a wider spectrum of real-world AI applications in the near future.

MiniCPM-V：在您的手機上運行的 GPT-4V 級別 MLLM

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

摘要

Support