Mellum2技術報告

要旨

本稿では、トークンあたり2.5Bのアクティブパラメータを持つ、オープンウェイトの12BパラメータMixture-of-Experts (MoE) 言語モデルであるMellum 2を発表する。Mellum 2はソフトウェアエンジニアリングに特化した汎用言語モデルであり、コード生成・編集、デバッグ、マルチステップ推論、ツール利用と関数呼び出し、エージェント型コーディング、対話型プログラミング支援を網羅する。これは、完了タスクに特化した4B denseモデルであるMellumの後継にあたる。アーキテクチャはMixture-of-Experts（64エキスパート、8アクティブ）を基盤とし、4つのKVヘッドによるGrouped-Query Attention、4層ごとに3層に適用されるSliding Window Attention、そして補助的な事前学習目的と投機的復号化のための内蔵ドラフトモデルとして機能する単一のMulti-Token Predictionヘッドを組み合わせている。各設計選択は、市販GPUでの推論効率を設計制約としたアブレーション実験により検証されている。事前学習は約10.6兆トークンにわたって3段階のカリキュラムで行われ、多様なWebデータから厳選されたコードおよび数学コンテンツへと混合比が段階的に移行する。最適化には、FP8ハイブリッド精度下でのMuonと、線形減衰を伴うWarmup-Hold-Decayスケジュールを採用した。事前学習済みベースモデルは、層選択的なYaRNを介して128Kコンテキストウィンドウに拡張され、その後2段階のポストトレーニング（教師ありファインチューニング、続いてRLVR）を経て、2つのリリースバリアントが生成される。直接回答を行うInstructモデルと、最終回答の前に明示的な推論過程を出力するThinkingモデルである。コード生成、数学・推論、ツール利用、知識、安全性の各ベンチマークにおいて、Mellum 2は2.5B denseモデルのトークンあたりの計算量で動作しながら、4B～14Bの範囲のオープンウェイトベースラインと競争力のある性能を示す。本稿では、ベース、インストラクト、シンキングの各チェックポイントを、アーキテクチャの決定、データパイプライン、トレーニングレシピに関する本レポートとともに、Apache 2.0ライセンスの下で公開する。

English

We present Mellum 2, an open-weight 12B-parameter Mixture-of-Experts (MoE) language model with 2.5B active parameters per token. Mellum 2 is a general-purpose language model specialized in software engineering, spanning code generation and editing, debugging, multi-step reasoning, tool use and function calling, agentic coding, and conversational programming assistance, and it is the successor to the completion-focused 4B dense Mellum model. The architecture builds on the Mixture-of-Experts (64 experts, 8 active) and combines Grouped-Query Attention with 4 KV heads, Sliding Window Attention on three of every four layers, and a single Multi-Token Prediction head that doubles as both an auxiliary pre-training objective and a built-in draft model for speculative decoding; each choice was validated by ablation with inference efficiency on commodity GPUs as a design constraint. Pre-training spans approximately 10.6 trillion tokens through a three-phase curriculum that progressively shifts the mixture from diverse web data toward curated code and mathematical content, optimized with Muon under FP8 hybrid precision and a Warmup-Hold-Decay schedule with linear decay to zero. The pre-trained base is extended to a 128K context window via a layer-selective YaRN and then post-trained in two stages (supervised fine-tuning followed by RLVR), yielding two released variants: an Instruct model that answers directly and a Thinking model that emits an explicit reasoning trace before its final answer. Across code generation, math and reasoning, tool use, knowledge, and safety benchmarks, Mellum 2 is competitive with open-weight baselines in the 4B-14B range while running at the per-token compute of a 2.5B dense model. We release the base, instruct, and thinking checkpoints, together with this report on the architecture decisions, data pipeline, and training recipe behind them, under the Apache 2.0 license.