MiniCPM4: 엔드 디바이스용 초고효율 대형 언어 모델

초록

본 논문은 엔드 사이드 디바이스를 위해 특별히 설계된 고효율 대규모 언어 모델(LLM)인 MiniCPM4를 소개한다. 우리는 모델 아키텍처, 훈련 데이터, 훈련 알고리즘, 추론 시스템이라는 네 가지 핵심 차원에서의 체계적인 혁신을 통해 이러한 효율성을 달성했다. 구체적으로, 모델 아키텍처 측면에서는 장문맥 처리 시 프리필링(prefilling) 및 디코딩(decoding) 단계를 모두 가속화하는 훈련 가능한 희소 주의 메커니즘인 InfLLM v2를 제안한다. 훈련 데이터 측면에서는 효율적이고 정확한 사전 훈련 데이터 필터링 및 생성 전략인 UltraClean과 포괄적인 지도 미세 조정 데이터셋인 UltraChat v2를 제안한다. 이러한 데이터셋은 단 8조 개의 훈련 토큰만으로도 만족스러운 모델 성능을 달성할 수 있게 한다. 훈련 알고리즘 측면에서는 효율적인 사전 훈련 전략 탐색을 위한 ModelTunnel v2를 제안하고, 부하 균형 강화 학습을 위한 청크 단위 롤아웃(chunk-wise rollout)과 데이터 효율적인 삼진 LLM인 BitCPM을 도입하여 기존의 사후 훈련 방법을 개선했다. 추론 시스템 측면에서는 희소 주의, 모델 양자화, 추측 샘플링(speculative sampling)을 통합하여 효율적인 프리필링과 디코딩을 달성하는 CPM.cu를 제안한다. 다양한 온디바이스 요구 사항을 충족하기 위해 MiniCPM4는 각각 0.5B와 8B 파라미터를 가진 두 가지 버전으로 제공된다. 충분한 평가 결과는 MiniCPM4가 여러 벤치마크에서 유사한 크기의 오픈소스 모델들을 능가하며, 그 효율성과 효과성을 입증한다. 특히, MiniCPM4-8B는 장문 시퀀스 처리 시 Qwen3-8B 대비 상당한 속도 개선을 보여준다. 추가적인 적응을 통해 MiniCPM4는 신뢰할 수 있는 설문 생성 및 모델 컨텍스트 프로토콜을 활용한 도구 사용 등 다양한 애플리케이션을 성공적으로 구동하며, 그 광범위한 활용 가능성을 명확히 보여준다.

English

This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. Regarding training data, we propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient pre-training strategy search, and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose CPM.cu that integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding. To meet diverse on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively. Sufficient evaluation results show that MiniCPM4 outperforms open-source models of similar size across multiple benchmarks, highlighting both its efficiency and effectiveness. Notably, MiniCPM4-8B demonstrates significant speed improvements over Qwen3-8B when processing long sequences. Through further adaptation, MiniCPM4 successfully powers diverse applications, including trustworthy survey generation and tool use with model context protocol, clearly showcasing its broad usability.

MiniCPM4: 엔드 디바이스용 초고효율 대형 언어 모델

MiniCPM4: Ultra-Efficient LLMs on End Devices

초록

Support