Xmodel-2.5:13亿参数的高效数据推理专用轻量级语言模型
Xmodel-2.5: 1.3B Data-Efficient Reasoning SLM
November 23, 2025
作者: Yang Liu, Xiaolong Zhong, Ling Jiang
cs.AI
摘要
大型语言模型虽具备强大的推理与工具调用能力,但其计算需求使其难以应用于边缘设备或成本敏感场景。我们推出Xmodel-2.5——一个13亿参数的小型语言模型,可作为即插即用的智能体核心。通过最大更新参数化(μP)训练,基于2000万参数代理模型调优的超参数可直接迁移至全模型,即使在参数绑定词嵌入架构下仍适用。采用1.4T词元的预热-稳定-衰减训练课程,我们进一步发现:在衰减阶段将AdamW优化器切换为Muon后,13项推理任务平均性能提升4.58%,且所有其他超参数保持不变。这验证了早期AdamW的稳定性与后期Muon的锐化能力可协同提升下游性能。FP8混合精度训练在精度与吞吐量间取得平衡。所有检查点、训练方案和评估代码均基于Apache-2.0协议开源:模型主站https://huggingface.co/XiaoduoAILab/Xmodel-2.5,训练检查点https://huggingface.co/XiaoduoAILab/Xmodel-2.5-history,训练代码与评估工具https://github.com/XiaoduoAILab/Xmodel-2.5。
English
Large language models deliver strong reasoning and tool-use skills, yet their computational demands make them impractical for edge or cost-sensitive deployments. We present Xmodel-2.5, a 1.3-billion-parameter small language model designed as a drop-in agent core. Training with maximal-update parameterization (μP) allows hyper-parameters tuned on a 20M-parameter proxy to transfer directly to the full model, even under the parameter-tied tie-word-embedding architecture. A 1.4T-token Warmup--Stable--Decay curriculum is used, and we further show that switching from AdamW to Muon during the decay phase improves the 13-task reasoning average by 4.58\,\% while keeping every other hyper-parameter fixed, verifying that early AdamW stability can be paired with late Muon sharpening for better downstream performance. FP8-mixed-precision training balances accuracy and throughput. All checkpoints, recipes, and evaluation code are released under the Apache-2.0 license.https://huggingface.co/XiaoduoAILab/Xmodel-2.5 and https://huggingface.co/XiaoduoAILab/Xmodel-2.5-history (training checkpoints). Training code and evaluation harness: https://github.com/XiaoduoAILab/Xmodel-2.5.