Mellum2 Technisch Rapport

Samenvatting

We presenteren Mellum 2, een open-gewicht 12B-parameter Mixture-of-Experts (MoE) taalmodel met 2,5B actieve parameters per token. Mellum 2 is een algemeen doel taalmodel gespecialiseerd in software engineering, dat codegeneratie en -bewerking, debuggen, meerstapsredenering, toolgebruik en functie-aanroepen, agentisch programmeren en conversationele programmeerondersteuning omvat, en het is de opvolger van het op voltooiing gerichte 4B dichte Mellum-model. De architectuur is gebaseerd op Mixture-of-Experts (64 experts, 8 actief) en combineert Grouped-Query Attention met 4 KV-heads, Sliding Window Attention op drie van elke vier lagen, en een enkele Multi-Token Prediction-head die dient als zowel een hulpvoorbewerkingsdoelstelling als een ingebouwd conceptmodel voor speculatieve decodering; elke keuze is gevalideerd door ablatie met rekenefficiëntie op gangbare GPU's als ontwerpbeperking. Voorbewerking omvat ongeveer 10,6 biljoen tokens via een driefasig curriculum dat het mengsel geleidelijk verschuift van diverse webgegevens naar gecureerde code en wiskundige inhoud, geoptimaliseerd met Muon onder FP8 hybride precisie en een Warmup-Hold-Decay schema met lineaire afname naar nul. De voorbewerkte basis wordt uitgebreid naar een 128K contextvenster via een laagselectieve YaRN en vervolgens nabehandeld in twee fasen (begeleide fijnafstemming gevolgd door RLVR), wat twee uitgebrachte varianten oplevert: een Instruct-model dat direct antwoordt en een Thinking-model dat een expliciet redeneerspoor uitzendt voordat het zijn definitieve antwoord geeft. Op het gebied van codegeneratie, wiskunde en redeneren, toolgebruik, kennis en veiligheidsbenchmarks is Mellum 2 concurrerend met open-gewicht baselines in het 4B-14B-bereik terwijl het draait op de per-token compute van een 2,5B dicht model. We brengen de basis-, instruct- en thinking checkpoints uit, samen met dit rapport over de architectuurbeslissingen, gegevenspijplijn en trainingsmethode erachter, onder de Apache 2.0-licentie.

English

We present Mellum 2, an open-weight 12B-parameter Mixture-of-Experts (MoE) language model with 2.5B active parameters per token. Mellum 2 is a general-purpose language model specialized in software engineering, spanning code generation and editing, debugging, multi-step reasoning, tool use and function calling, agentic coding, and conversational programming assistance, and it is the successor to the completion-focused 4B dense Mellum model. The architecture builds on the Mixture-of-Experts (64 experts, 8 active) and combines Grouped-Query Attention with 4 KV heads, Sliding Window Attention on three of every four layers, and a single Multi-Token Prediction head that doubles as both an auxiliary pre-training objective and a built-in draft model for speculative decoding; each choice was validated by ablation with inference efficiency on commodity GPUs as a design constraint. Pre-training spans approximately 10.6 trillion tokens through a three-phase curriculum that progressively shifts the mixture from diverse web data toward curated code and mathematical content, optimized with Muon under FP8 hybrid precision and a Warmup-Hold-Decay schedule with linear decay to zero. The pre-trained base is extended to a 128K context window via a layer-selective YaRN and then post-trained in two stages (supervised fine-tuning followed by RLVR), yielding two released variants: an Instruct model that answers directly and a Thinking model that emits an explicit reasoning trace before its final answer. Across code generation, math and reasoning, tool use, knowledge, and safety benchmarks, Mellum 2 is competitive with open-weight baselines in the 4B-14B range while running at the per-token compute of a 2.5B dense model. We release the base, instruct, and thinking checkpoints, together with this report on the architecture decisions, data pipeline, and training recipe behind them, under the Apache 2.0 license.