Motif 2 12.7B技术报告
Motif 2 12.7B technical report
November 7, 2025
作者: Junghwan Lim, Sungmin Lee, Dongseok Kim, Taehyun Kim, Eunhwan Park, Jeesoo Lee, Jeongdoo Lee, Junhyeok Lee, Wai Ting Cheung, Dahye Choi, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Beomgyu Kim, Minjae Kim, Taewhan Kim, Youngrok Kim, Hyukjin Kweon, Haesol Lee, Kungyu Lee, Dongpin Oh, Yeongjae Park, Bokki Ryu, Dongjoo Weon
cs.AI
摘要
我们推出Motif-2-12.7B——一款新型开放权重基础模型,通过架构创新与系统级优化的结合,将大语言模型的效率边界推向新高度。该模型专为在有限算力预算下实现可扩展语言理解与强健的指令泛化能力而设计,在Motif-2.6B基础上集成分组差分注意力机制(GDA),通过分离信号通路与噪声控制通路来提升表征效率。模型基于课程驱动型数据调度策略,在包含语言、数学、科学及编程领域的5.5万亿token语料上进行预训练,该策略会动态调整数据构成比例。训练系统采用MuonClip优化器与定制高性能内核,包括融合式PolyNorm激活函数与并行Muon算法,在大规模分布式环境中实现显著吞吐量与内存效率提升。后训练阶段采用三阶段监督微调流程,依次增强通用指令遵循能力、组合推理能力及语言精确性。Motif-2-12.7B在多项基准测试中展现卓越性能,证明经过精密设计的架构扩展与优化训练方案足以媲美规模更大的模型。
English
We introduce Motif-2-12.7B, a new open-weight foundation model that pushes the efficiency frontier of large language models by combining architectural innovation with system-level optimization. Designed for scalable language understanding and robust instruction generalization under constrained compute budgets, Motif-2-12.7B builds upon Motif-2.6B with the integration of Grouped Differential Attention (GDA), which improves representational efficiency by disentangling signal and noise-control attention pathways. The model is pre-trained on 5.5 trillion tokens spanning diverse linguistic, mathematical, scientific, and programming domains using a curriculum-driven data scheduler that gradually changes the data composition ratio. The training system leverages the MuonClip optimizer alongside custom high-performance kernels, including fused PolyNorm activations and the Parallel Muon algorithm, yielding significant throughput and memory efficiency gains in large-scale distributed environments. Post-training employs a three-stage supervised fine-tuning pipeline that successively enhances general instruction adherence, compositional understanding, and linguistic precision. Motif-2-12.7B demonstrates competitive performance across diverse benchmarks, showing that thoughtful architectural scaling and optimized training design can rival the capabilities of much larger models.