Motif 2 12.7B 技術報告
Motif 2 12.7B technical report
November 7, 2025
作者: Junghwan Lim, Sungmin Lee, Dongseok Kim, Taehyun Kim, Eunhwan Park, Jeesoo Lee, Jeongdoo Lee, Junhyeok Lee, Wai Ting Cheung, Dahye Choi, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Beomgyu Kim, Minjae Kim, Taewhan Kim, Youngrok Kim, Hyukjin Kweon, Haesol Lee, Kungyu Lee, Dongpin Oh, Yeongjae Park, Bokki Ryu, Dongjoo Weon
cs.AI
摘要
我們推出 Motif-2-12.7B,這是一款新型開放權重基礎模型,通過結合架構創新與系統級優化,推進了大型語言模型的效率邊界。該模型專為在有限計算預算下實現可擴展的語言理解與穩健的指令泛化能力而設計,基於 Motif-2.6B 架構整合了分組差分注意力(GDA)機制,藉由分離信號與噪聲控制的注意力路徑來提升表徵效率。模型使用課程驅動的數據調度策略,在涵蓋多語言、數學、科學及程式設計領域的 5.5 兆詞元上進行預訓練,並逐步調整數據組成比例。訓練系統採用 MuonClip 優化器與定制高性能核心,包括融合式 PolyNorm 激活函數與並行 Muon 算法,在大規模分散式環境中實現了顯著的吞吐量與記憶體效率提升。後訓練階段採用三階段監督微調流程,逐步強化通用指令遵循、組合性理解與語言精確度。Motif-2-12.7B 在多項基準測試中展現出極具競爭力的表現,證明深思熟慮的架構擴展與優化訓練設計足以媲美規模更大的模型能力。
English
We introduce Motif-2-12.7B, a new open-weight foundation model that pushes the efficiency frontier of large language models by combining architectural innovation with system-level optimization. Designed for scalable language understanding and robust instruction generalization under constrained compute budgets, Motif-2-12.7B builds upon Motif-2.6B with the integration of Grouped Differential Attention (GDA), which improves representational efficiency by disentangling signal and noise-control attention pathways. The model is pre-trained on 5.5 trillion tokens spanning diverse linguistic, mathematical, scientific, and programming domains using a curriculum-driven data scheduler that gradually changes the data composition ratio. The training system leverages the MuonClip optimizer alongside custom high-performance kernels, including fused PolyNorm activations and the Parallel Muon algorithm, yielding significant throughput and memory efficiency gains in large-scale distributed environments. Post-training employs a three-stage supervised fine-tuning pipeline that successively enhances general instruction adherence, compositional understanding, and linguistic precision. Motif-2-12.7B demonstrates competitive performance across diverse benchmarks, showing that thoughtful architectural scaling and optimized training design can rival the capabilities of much larger models.