ChatPaper.aiChatPaper

MiniCPM-V:您手机上的GPT-4V级MLLM

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

August 3, 2024
作者: Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun
cs.AI

摘要

最近多模态大型语言模型(MLLMs)的激增从根本上改变了人工智能研究和产业的格局,为迈向下一个人工智能里程碑指明了一条充满希望的道路。然而,仍然存在重大挑战阻碍了MLLMs在实际应用中的可行性。最显著的挑战来自运行具有大量参数和广泛计算的MLLM的巨大成本。因此,大多数MLLMs需要部署在高性能云服务器上,这极大地限制了它们的应用范围,如移动、离线、对能源敏感和保护隐私的场景。在这项工作中,我们提出了MiniCPM-V,一系列可部署在端侧设备上的高效MLLMs。通过在架构、预训练和对齐方面整合最新的MLLM技术,最新的MiniCPM-Llama3-V 2.5具有几个显著特点:(1)强大性能,在OpenCompass上胜过GPT-4V-1106、Gemini Pro和Claude 3,这是对11个热门基准测试的全面评估,(2)强大的OCR能力和1.8M像素的高分辨率图像感知,适用于任何纵横比,(3)低幻觉率的可信行为,(4)支持30多种语言的多语言支持,以及(5)在手机上的高效部署。更重要的是,MiniCPM-V可以被视为一个有前途的趋势的代表性示例:为了实现可用性(例如GPT-4V)级别的性能,模型大小正在迅速减小,同时端侧计算能力快速增长。这共同表明,部署在端设备上的GPT-4V级别MLLMs正变得越来越可能,从而在不久的将来打开更广泛的实际人工智能应用领域。
English
The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of parameters and extensive computation. As a result, most MLLMs need to be deployed on high-performing cloud servers, which greatly limits their application scopes such as mobile, offline, energy-sensitive, and privacy-protective scenarios. In this work, we present MiniCPM-V, a series of efficient MLLMs deployable on end-side devices. By integrating the latest MLLM techniques in architecture, pretraining and alignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1) Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strong OCR capability and 1.8M pixel high-resolution image perception at any aspect ratio, (3) trustworthy behavior with low hallucination rates, (4) multilingual support for 30+ languages, and (5) efficient deployment on mobile phones. More importantly, MiniCPM-V can be viewed as a representative example of a promising trend: The model sizes for achieving usable (e.g., GPT-4V) level performance are rapidly decreasing, along with the fast growth of end-side computation capacity. This jointly shows that GPT-4V level MLLMs deployed on end devices are becoming increasingly possible, unlocking a wider spectrum of real-world AI applications in the near future.

Summary

AI-Generated Summary

PDF836November 28, 2024