Xmodel-2.5: 1.3B 데이터 효율적 추론 소형 언어 모델

초록

대규모 언어 모델은 강력한 추론 및 도구 활용 능력을 제공하지만, 높은 컴퓨팅 요구량으로 인해 에지 디바이스나 비용에 민감한 환경에서의 배포는 실용적이지 않다. 본 논문은 13억 개의 매개변수를 가진 소형 언어 모델인 Xmodel-2.5를 드롭인(drop-in) 에이전트 코어로 설계하여 제안한다. 최대 업데이트 매개변수화(μP)를 활용한 학습을 통해 2000만 매개변수 프록시 모델에서 튜닝된 하이퍼파라미터를 매개변수가 공유된 단어 임베딩 연결(tie-word-embedding) 아키텍처 하에서도 전체 모델로 직접 전이할 수 있다. 1.4조 토큰 규모의 웜업-안정화-감소(Warmup–Stable–Decay) 학습 커리큘럼을 적용하였으며, 감소 단계에서 AdamW 옵티마이저를 Muon으로 전환하면 다른 모든 하이퍼파라미터를 고정한 상태에서 13개 추론 과제의 평균 성능이 4.58% 향상됨을 추가로 확인하였다. 이는 초기 AdamW의 안정성과 후기 Muon의 샤프닝 효과를 결합하여 다운스트림 성능을 개선할 수 있음을 입증한다. FP8 혼합 정밀도 학습을 통해 정확도와 처리량의 균형을 달성하였다. 모든 체크포인트, 학습 레시피 및 평가 코드는 Apache-2.0 라이선스 하에 공개되었다.

English

Large language models deliver strong reasoning and tool-use skills, yet their computational demands make them impractical for edge or cost-sensitive deployments. We present Xmodel-2.5, a 1.3-billion-parameter small language model designed as a drop-in agent core. Training with maximal-update parameterization (μP) allows hyper-parameters tuned on a 20M-parameter proxy to transfer directly to the full model, even under the parameter-tied tie-word-embedding architecture. A 1.4T-token Warmup--Stable--Decay curriculum is used, and we further show that switching from AdamW to Muon during the decay phase improves the 13-task reasoning average by 4.58\,\% while keeping every other hyper-parameter fixed, verifying that early AdamW stability can be paired with late Muon sharpening for better downstream performance. FP8-mixed-precision training balances accuracy and throughput. All checkpoints, recipes, and evaluation code are released under the Apache-2.0 license.https://huggingface.co/XiaoduoAILab/Xmodel-2.5 and https://huggingface.co/XiaoduoAILab/Xmodel-2.5-history (training checkpoints). Training code and evaluation harness: https://github.com/XiaoduoAILab/Xmodel-2.5.

Xmodel-2.5: 1.3B 데이터 효율적 추론 소형 언어 모델

Xmodel-2.5: 1.3B Data-Efficient Reasoning SLM

초록

Support