MiniCPM4：终端设备上的超高效大型语言模型

摘要

本文介绍了MiniCPM4，一款专为终端设备设计的高效大型语言模型（LLM）。我们通过模型架构、训练数据、训练算法及推理系统四个关键维度的系统性创新实现了这一高效性。具体而言，在模型架构方面，我们提出了InfLLM v2，一种可训练的稀疏注意力机制，加速了长上下文处理中的预填充和解码阶段。在训练数据方面，我们提出了UltraClean，一种高效且准确的预训练数据过滤与生成策略，以及UltraChat v2，一个全面的监督微调数据集，这些数据集使得仅需8万亿训练标记即可达到满意的模型性能。在训练算法上，我们提出了ModelTunnel v2用于高效预训练策略搜索，并通过引入分块式rollout实现负载均衡的强化学习及数据高效的三元LLM——BitCPM，改进了现有的后训练方法。在推理系统方面，我们提出了CPM.cu，它集成了稀疏注意力、模型量化和推测采样，以实现高效的预填充和解码。为满足多样化的设备端需求，MiniCPM4提供0.5B和8B参数两个版本。充分的评估结果显示，MiniCPM4在多个基准测试中均优于同规模的开源模型，凸显了其效率与效能。值得注意的是，在处理长序列时，MiniCPM4-8B相比Qwen3-8B展现出显著的加速效果。通过进一步适配，MiniCPM4成功赋能了包括可信调查生成及模型上下文协议下的工具使用在内的多样化应用，清晰展现了其广泛的适用性。

English

This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. Regarding training data, we propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient pre-training strategy search, and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose CPM.cu that integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding. To meet diverse on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively. Sufficient evaluation results show that MiniCPM4 outperforms open-source models of similar size across multiple benchmarks, highlighting both its efficiency and effectiveness. Notably, MiniCPM4-8B demonstrates significant speed improvements over Qwen3-8B when processing long sequences. Through further adaptation, MiniCPM4 successfully powers diverse applications, including trustworthy survey generation and tool use with model context protocol, clearly showcasing its broad usability.

MiniCPM4：终端设备上的超高效大型语言模型

MiniCPM4: Ultra-Efficient LLMs on End Devices

摘要

Support