Mellum2 기술 보고서

초록

본 논문에서는 Mellum 2를 소개합니다. Mellum 2는 토큰당 25억 개의 활성 파라미터를 가진 120억 파라미터 규모의 오픈 가중치 Mixture-of-Experts(MoE) 언어 모델입니다. Mellum 2는 소프트웨어 공학에 특화된 범용 언어 모델로, 코드 생성 및 편집, 디버깅, 다단계 추론, 도구 사용 및 함수 호출, 에이전트 코딩, 대화형 프로그래밍 지원을 아우르며, 완성에 초점을 맞춘 40억 파라미터의 Dense 모델 Mellum의 후속 모델입니다. 아키텍처는 Mixture-of-Experts(64개 전문가, 8개 활성)를 기반으로 구축되었으며, 4개의 KV 헤드를 가진 Grouped-Query Attention, 4개 층 중 3개 층에 적용되는 Sliding Window Attention, 그리고 보조 사전 학습 목표이자 추론 디코딩을 위한 내장 드래프트 모델 역할을 겸하는 단일 Multi-Token Prediction 헤드를 결합했습니다. 각 선택은 상용 GPU에서의 추론 효율성을 설계 제약 조건으로 삼아 절제 실험을 통해 검증되었습니다. 사전 학습은 약 10조 6천억 개의 토큰에 걸쳐 진행되었으며, 3단계 커리큘럼을 통해 데이터 혼합을 다양한 웹 데이터에서 큐레이션된 코드 및 수학 콘텐츠로 점진적으로 전환하고, FP8 하이브리드 정밀도와 선형 감소를 통해 0에 도달하는 Warmup-Hold-Decay 스케줄을 적용한 Muon으로 최적화했습니다. 사전 학습된 기본 모델은 층 선택적 YaRN을 통해 128K 컨텍스트 윈도우로 확장된 후, 두 단계(지도 미세 조정 후 RLVR)로 사후 학습되어 두 가지 변형 모델이 공개되었습니다: 직접 답변을 생성하는 Instruct 모델과 최종 답변 전에 명시적 추론 과정을 출력하는 Thinking 모델입니다. 코드 생성, 수학 및 추론, 도구 사용, 지식, 안전 벤치마크 전반에 걸쳐 Mellum 2는 토큰당 25억 파라미터 Dense 모델의 연산량으로 동작하면서 40억~140억 파라미터 범위의 오픈 가중치 기준 모델과 경쟁력을 보여줍니다. 우리는 기본 모델, Instruct, Thinking 체크포인트를 아키텍처 결정, 데이터 파이프라인 및 학습 레시피에 대한 본 보고서와 함께 Apache 2.0 라이선스 하에 공개합니다.

English

We present Mellum 2, an open-weight 12B-parameter Mixture-of-Experts (MoE) language model with 2.5B active parameters per token. Mellum 2 is a general-purpose language model specialized in software engineering, spanning code generation and editing, debugging, multi-step reasoning, tool use and function calling, agentic coding, and conversational programming assistance, and it is the successor to the completion-focused 4B dense Mellum model. The architecture builds on the Mixture-of-Experts (64 experts, 8 active) and combines Grouped-Query Attention with 4 KV heads, Sliding Window Attention on three of every four layers, and a single Multi-Token Prediction head that doubles as both an auxiliary pre-training objective and a built-in draft model for speculative decoding; each choice was validated by ablation with inference efficiency on commodity GPUs as a design constraint. Pre-training spans approximately 10.6 trillion tokens through a three-phase curriculum that progressively shifts the mixture from diverse web data toward curated code and mathematical content, optimized with Muon under FP8 hybrid precision and a Warmup-Hold-Decay schedule with linear decay to zero. The pre-trained base is extended to a 128K context window via a layer-selective YaRN and then post-trained in two stages (supervised fine-tuning followed by RLVR), yielding two released variants: an Instruct model that answers directly and a Thinking model that emits an explicit reasoning trace before its final answer. Across code generation, math and reasoning, tool use, knowledge, and safety benchmarks, Mellum 2 is competitive with open-weight baselines in the 4B-14B range while running at the per-token compute of a 2.5B dense model. We release the base, instruct, and thinking checkpoints, together with this report on the architecture decisions, data pipeline, and training recipe behind them, under the Apache 2.0 license.