Mellum2技術報告

摘要

我們提出 Mellum 2，這是一個開放權重的 12B 參數混合專家（MoE）語言模型，每個 token 僅有 2.5B 活躍參數。Mellum 2 是一款通用型語言模型，專精於軟體工程領域，涵蓋程式碼生成與編輯、除錯、多步驟推理、工具使用與函式呼叫、自主編碼以及對話式程式設計輔助，它是先前專注於補全的 4B 密集參數 Mellum 模型的後繼者。其架構基於混合專家模型（64 個專家，8 個活躍），並結合了分組查詢注意力（4 個 KV 頭）、每四層中有三層採用滑動視窗注意力，以及一個多重 token 預測頭——該預測頭同時作為輔助預訓練目標和投機解碼的內建草稿模型；每項設計選擇均透過消融實驗驗證，並以在商用 GPU 上的推理效率作為設計約束條件。預訓練過程歷時約 10.6 兆 token，採用三階段課程學習策略，逐步將資料混合從多樣化網路資料轉向精選程式碼與數學內容，並透過 Muon 優化器搭配 FP8 混合精度以及 Warmup-Hold-Decay 學習率排程（線性衰減至零）進行優化。預訓練基座透過層級選擇性 YaRN 擴展至 128K 上下文視窗，隨後分兩階段進行後訓練（監督式微調接續 RLVR），最終釋出兩種變體：直接回答的 Instruct 模型，以及在最終答案前輸出顯式推理鏈的 Thinking 模型。在程式碼生成、數學與推理、工具使用、知識以及安全基準測試中，Mellum 2 在每個 token 的運算量相當於 2.5B 密集模型的情況下，與 4B 至 14B 參數範圍內的開放權重基線模型表現相當。我們以 Apache 2.0 授權釋出基礎模型、指令模型與思考模型檢查點，並附上這份關於架構決策、資料管道及訓練方案的技術報告。

English

We present Mellum 2, an open-weight 12B-parameter Mixture-of-Experts (MoE) language model with 2.5B active parameters per token. Mellum 2 is a general-purpose language model specialized in software engineering, spanning code generation and editing, debugging, multi-step reasoning, tool use and function calling, agentic coding, and conversational programming assistance, and it is the successor to the completion-focused 4B dense Mellum model. The architecture builds on the Mixture-of-Experts (64 experts, 8 active) and combines Grouped-Query Attention with 4 KV heads, Sliding Window Attention on three of every four layers, and a single Multi-Token Prediction head that doubles as both an auxiliary pre-training objective and a built-in draft model for speculative decoding; each choice was validated by ablation with inference efficiency on commodity GPUs as a design constraint. Pre-training spans approximately 10.6 trillion tokens through a three-phase curriculum that progressively shifts the mixture from diverse web data toward curated code and mathematical content, optimized with Muon under FP8 hybrid precision and a Warmup-Hold-Decay schedule with linear decay to zero. The pre-trained base is extended to a 128K context window via a layer-selective YaRN and then post-trained in two stages (supervised fine-tuning followed by RLVR), yielding two released variants: an Instruct model that answers directly and a Thinking model that emits an explicit reasoning trace before its final answer. Across code generation, math and reasoning, tool use, knowledge, and safety benchmarks, Mellum 2 is competitive with open-weight baselines in the 4B-14B range while running at the per-token compute of a 2.5B dense model. We release the base, instruct, and thinking checkpoints, together with this report on the architecture decisions, data pipeline, and training recipe behind them, under the Apache 2.0 license.