MiniCPM4：エンドデバイス向け超効率LLM

要旨

本論文では、エンドサイドデバイス向けに設計された高効率な大規模言語モデル（LLM）であるMiniCPM4を紹介する。この効率性は、モデルアーキテクチャ、学習データ、学習アルゴリズム、推論システムの4つの主要な次元における体系的な革新によって実現されている。具体的には、モデルアーキテクチャに関して、長文脈処理のためのプリフィリングとデコードの両フェーズを加速する学習可能なスパースアテンションメカニズムであるInfLLM v2を提案する。学習データに関しては、効率的かつ正確な事前学習データのフィルタリングと生成戦略であるUltraClean、および包括的な教師ありファインチューニングデータセットであるUltraChat v2を提案する。これらのデータセットにより、わずか8兆の学習トークンで満足のいくモデル性能を達成することが可能となる。学習アルゴリズムに関しては、効率的な事前学習戦略探索のためのModelTunnel v2を提案し、ロードバランス型強化学習のためのチャンクワイズロールアウトとデータ効率の高い3値LLMであるBitCPMを導入することで、既存の事後学習手法を改善する。推論システムに関しては、スパースアテンション、モデル量子化、および推測サンプリングを統合したCPM.cuを提案し、効率的なプリフィリングとデコードを実現する。多様なオンデバイス要件に対応するため、MiniCPM4は0.5Bと8Bのパラメータを持つ2つのバージョンで提供される。十分な評価結果は、MiniCPM4が複数のベンチマークにおいて類似サイズのオープンソースモデルを上回る性能を示し、その効率性と有効性を強調している。特に、MiniCPM4-8Bは、長いシーケンスを処理する際にQwen3-8Bに対して大幅な速度向上を示す。さらに適応を進めることで、MiniCPM4は信頼性の高い調査生成やモデルコンテキストプロトコルを用いたツール使用など、多様なアプリケーションを成功裏に駆動し、その幅広い有用性を明確に示している。

English

This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. Regarding training data, we propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient pre-training strategy search, and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose CPM.cu that integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding. To meet diverse on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively. Sufficient evaluation results show that MiniCPM4 outperforms open-source models of similar size across multiple benchmarks, highlighting both its efficiency and effectiveness. Notably, MiniCPM4-8B demonstrates significant speed improvements over Qwen3-8B when processing long sequences. Through further adaptation, MiniCPM4 successfully powers diverse applications, including trustworthy survey generation and tool use with model context protocol, clearly showcasing its broad usability.

MiniCPM4：エンドデバイス向け超効率LLM

MiniCPM4: Ultra-Efficient LLMs on End Devices

要旨

Support