Mellum2技术报告
Mellum2 Technical Report
May 29, 2026
作者: Marko Kojic, Ivan Bondyrev, Aral de Moor, Joseph Shtok, Petr Borovlev, Kseniia Lysaniuk, Madeeswaran Kannan, Ivan Dolgov, Nikita Pavlichenko
cs.AI
摘要
我们提出了 Mellum 2,一个开放权重的 12B 参数混合专家(MoE)语言模型,每个 token 仅有 2.5B 活跃参数。Mellum 2 是一款通用语言模型,专精于软件工程领域,涵盖代码生成与编辑、调试、多步推理、工具使用与函数调用、智能体编程以及对话式编程辅助,是先前专注于补全任务的 4B 密集模型 Mellum 的后续版本。其架构基于混合专家(64 个专家,8 个活跃),结合了分组查询注意力(4 个 KV 头)、每四层中有三层使用的滑动窗口注意力,以及一个兼具辅助预训练目标和内置投机解码草稿模型功能的单一多头预测头;每个设计选择均通过消融实验验证,并以消费级 GPU 的推理效率作为设计约束。预训练阶段约 10.6 万亿 token,通过三阶段课程逐步将数据混合从多样化网络数据转向精选代码与数学内容,采用 FP8 混合精度下的 Muon 优化器及 Warmup-Hold-Decay 学习率调度(线性衰减至零)。预训练基座通过层选择性 YaRN 扩展至 128K 上下文窗口,随后经过两阶段后训练(监督微调后接 RLVR),生成两个发布版本:直接回答的 Instruct 模型和生成最终答案前输出显式推理轨迹的 Thinking 模型。在代码生成、数学与推理、工具使用、知识及安全基准测试中,Mellum 2 与 4B-14B 参数范围的开放权重基线模型竞争力相当,同时其每 token 计算量仅相当于 2.5B 密集模型。我们在 Apache 2.0 许可下发布基础版、指令版和思考版检查点,并附上关于其架构决策、数据流程和训练方案的技术报告。
English
We present Mellum 2, an open-weight 12B-parameter Mixture-of-Experts (MoE) language model with 2.5B active parameters per token. Mellum 2 is a general-purpose language model specialized in software engineering, spanning code generation and editing, debugging, multi-step reasoning, tool use and function calling, agentic coding, and conversational programming assistance, and it is the successor to the completion-focused 4B dense Mellum model. The architecture builds on the Mixture-of-Experts (64 experts, 8 active) and combines Grouped-Query Attention with 4 KV heads, Sliding Window Attention on three of every four layers, and a single Multi-Token Prediction head that doubles as both an auxiliary pre-training objective and a built-in draft model for speculative decoding; each choice was validated by ablation with inference efficiency on commodity GPUs as a design constraint. Pre-training spans approximately 10.6 trillion tokens through a three-phase curriculum that progressively shifts the mixture from diverse web data toward curated code and mathematical content, optimized with Muon under FP8 hybrid precision and a Warmup-Hold-Decay schedule with linear decay to zero. The pre-trained base is extended to a 128K context window via a layer-selective YaRN and then post-trained in two stages (supervised fine-tuning followed by RLVR), yielding two released variants: an Instruct model that answers directly and a Thinking model that emits an explicit reasoning trace before its final answer. Across code generation, math and reasoning, tool use, knowledge, and safety benchmarks, Mellum 2 is competitive with open-weight baselines in the 4B-14B range while running at the per-token compute of a 2.5B dense model. We release the base, instruct, and thinking checkpoints, together with this report on the architecture decisions, data pipeline, and training recipe behind them, under the Apache 2.0 license.