MiniCPM-V：スマートフォン上で動作するGPT-4VレベルのMLLM

要旨

近年のマルチモーダル大規模言語モデル（MLLM）の急激な進展は、AI研究と産業の風景を根本的に変え、次のAIのマイルストーンに向けた有望な道筋を示しています。しかし、MLLMが実世界のアプリケーションで実用的になるためには、依然として大きな課題が残っています。最も顕著な課題は、膨大なパラメータ数と広範な計算を必要とするMLLMを実行するための莫大なコストです。その結果、ほとんどのMLLMは高性能なクラウドサーバーにデプロイする必要があり、モバイル、オフライン、エネルギーに敏感な環境、プライバシー保護が必要なシナリオなど、その適用範囲が大きく制限されています。本論文では、エンドサイドデバイスにデプロイ可能な効率的なMLLMシリーズであるMiniCPM-Vを紹介します。最新のMLLM技術をアーキテクチャ、事前学習、アライメントに統合した最新のMiniCPM-Llama3-V 2.5は、以下の注目すべき特徴を持っています：（1）強力な性能で、OpenCompass（11の主要なベンチマークを網羅した包括的評価）においてGPT-4V-1106、Gemini Pro、Claude 3を上回る、（2）強力なOCR能力と1.8Mピクセルの高解像度画像認識を任意のアスペクト比で実現、（3）低い幻覚率による信頼性の高い動作、（4）30以上の言語をサポートする多言語対応、（5）スマートフォンでの効率的なデプロイ。さらに重要なことに、MiniCPM-Vは、使用可能なレベル（例：GPT-4V）の性能を達成するためのモデルサイズが急速に縮小し、エンドサイドの計算能力が急速に向上しているという有望なトレンドの代表例と見なすことができます。これにより、エンドデバイスにデプロイされたGPT-4VレベルのMLLMがますます実現可能になり、近い将来に現実世界のAIアプリケーションの幅広いスペクトルが解き放たれることが示されています。

English

The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of parameters and extensive computation. As a result, most MLLMs need to be deployed on high-performing cloud servers, which greatly limits their application scopes such as mobile, offline, energy-sensitive, and privacy-protective scenarios. In this work, we present MiniCPM-V, a series of efficient MLLMs deployable on end-side devices. By integrating the latest MLLM techniques in architecture, pretraining and alignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1) Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strong OCR capability and 1.8M pixel high-resolution image perception at any aspect ratio, (3) trustworthy behavior with low hallucination rates, (4) multilingual support for 30+ languages, and (5) efficient deployment on mobile phones. More importantly, MiniCPM-V can be viewed as a representative example of a promising trend: The model sizes for achieving usable (e.g., GPT-4V) level performance are rapidly decreasing, along with the fast growth of end-side computation capacity. This jointly shows that GPT-4V level MLLMs deployed on end devices are becoming increasingly possible, unlocking a wider spectrum of real-world AI applications in the near future.

MiniCPM-V：スマートフォン上で動作するGPT-4VレベルのMLLM

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

要旨

Support