MiniCPM-V 4.5: 아키텍처, 데이터, 훈련 레시피를 통한 효율적인 MLLM 구축

초록

멀티모달 대형 언어 모델(MLLMs)은 빠르게 발전하고 있으며 AI 개발의 최전선을 대표합니다. 그러나 이러한 모델의 학습 및 추론 효율성은 MLLMs를 더욱 접근 가능하고 확장 가능하게 만드는 데 있어 핵심적인 병목 현상으로 부상했습니다. 이러한 문제를 해결하기 위해, 우리는 높은 효율성과 강력한 성능을 위해 설계된 8B 파라미터 모델인 MiniCPM-V 4.5를 제시합니다. 우리는 모델 아키텍처, 데이터 전략 및 학습 방법에서 세 가지 핵심 개선 사항을 도입했습니다: 이미지와 비디오에 대한 고도로 압축된 인코딩을 위한 통합 3D-Resampler 모델 아키텍처, 복잡한 데이터 엔지니어링 없이 문서 지식과 텍스트 인식을 위한 통합 학습 패러다임, 그리고 짧고 긴 추론 모드 모두에 능숙한 하이브리드 강화 학습 전략. OpenCompass 평가에서의 포괄적인 실험 결과는 MiniCPM-V 4.5가 GPT-4o-latest와 같은 널리 사용되는 독점 모델과 Qwen2.5-VL 72B와 같은 훨씬 더 큰 오픈소스 모델을 능가함을 보여줍니다. 특히, 이러한 강력한 성능은 놀라운 효율성과 함께 달성되었습니다. 예를 들어, 널리 채택된 VideoMME 벤치마크에서 MiniCPM-V 4.5는 30B 크기 미만의 모델 중에서 최고의 성능을 달성하며, Qwen2.5-VL 7B의 46.7% GPU 메모리 비용과 8.7% 추론 시간만을 사용합니다.

English

Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core bottleneck in making MLLMs more accessible and scalable. To address the challenges, we present MiniCPM-V 4.5, an 8B parameter model designed for high efficiency and strong performance. We introduce three core improvements in model architecture, data strategy and training method: a unified 3D-Resampler model architecture for highly compact encoding over images and videos, a unified learning paradigm for document knowledge and text recognition without heavy data engineering, and a hybrid reinforcement learning strategy for proficiency in both short and long reasoning modes. Comprehensive experimental results in OpenCompass evaluation show that MiniCPM-V 4.5 surpasses widely used proprietary models such as GPT-4o-latest, and significantly larger open-source models such as Qwen2.5-VL 72B. Notably, the strong performance is achieved with remarkable efficiency. For example, on the widely adopted VideoMME benchmark, MiniCPM-V 4.5 achieves state-of-the-art performance among models under 30B size, using just 46.7\% GPU memory cost and 8.7\% inference time of Qwen2.5-VL 7B.

MiniCPM-V 4.5: 아키텍처, 데이터, 훈련 레시피를 통한 효율적인 MLLM 구축

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

초록

Support