ChatPaper.aiChatPaper

MiniCPM-V:在您的手機上運行的 GPT-4V 級別 MLLM

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

August 3, 2024
作者: Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun
cs.AI

摘要

最近多模式大型語言模型(MLLMs)的激增徹底改變了人工智慧研究和產業的格局,為邁向下一個人工智慧里程碑指明了一條充滿希望的道路。然而,仍然存在著重大挑戰,阻礙了MLLMs在實際應用中的可行性。其中最引人注目的挑戰來自運行具有龐大參數和龐大計算量的MLLM所需的巨大成本。因此,大多數MLLMs需要部署在高性能的雲伺服器上,這大大限制了它們的應用範圍,如移動、離線、對能源敏感和保護隱私的情境。在這項工作中,我們提出了MiniCPM-V,這是一系列可部署在端設備上的高效MLLMs。通過在架構、預訓練和對齊方面整合最新的MLLM技術,最新的MiniCPM-Llama3-V 2.5 具有幾個顯著特點:(1)強大的性能,在OpenCompass上優於GPT-4V-1106、Gemini Pro和Claude 3,這是對11個熱門基準測試的全面評估,(2)強大的OCR能力和對任何長寬比的180萬像素高分辨率圖像感知,(3)低幻覺率的值得信賴的行為,(4)支持30多種語言的多語言支持,以及(5)在移動手機上的高效部署。更重要的是,MiniCPM-V可以被視為一個有前途的趨勢的代表性例子:實現可用性(例如GPT-4V)級別性能所需的模型大小正在迅速減小,與端設備計算能力的快速增長相呼應。這共同顯示,GPT-4V級別的MLLMs部署在端設備上正變得越來越可能,很快將在未來解鎖更廣泛的實際人工智慧應用領域。
English
The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of parameters and extensive computation. As a result, most MLLMs need to be deployed on high-performing cloud servers, which greatly limits their application scopes such as mobile, offline, energy-sensitive, and privacy-protective scenarios. In this work, we present MiniCPM-V, a series of efficient MLLMs deployable on end-side devices. By integrating the latest MLLM techniques in architecture, pretraining and alignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1) Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strong OCR capability and 1.8M pixel high-resolution image perception at any aspect ratio, (3) trustworthy behavior with low hallucination rates, (4) multilingual support for 30+ languages, and (5) efficient deployment on mobile phones. More importantly, MiniCPM-V can be viewed as a representative example of a promising trend: The model sizes for achieving usable (e.g., GPT-4V) level performance are rapidly decreasing, along with the fast growth of end-side computation capacity. This jointly shows that GPT-4V level MLLMs deployed on end devices are becoming increasingly possible, unlocking a wider spectrum of real-world AI applications in the near future.

Summary

AI-Generated Summary

PDF836November 28, 2024