**ArXiv到模型:科学语言模型训练的实践研究**
ArXiv-to-Model: A Practical Study of Scientific LM Training
February 19, 2026
作者: Anuj Gupta
cs.AI
摘要
尽管前沿大语言模型展现出强大的推理与数学能力,从原始数据训练领域专用科学语言模型的实际流程仍缺乏系统记录。本研究通过具体案例,详细阐述了基于数学、计算机科学和理论物理学领域的原始arXiv LaTeX源码训练1.36B参数科学语言模型的全过程。我们构建了端到端的流程链,涵盖元数据过滤、归档验证、LaTeX解析、文本规范化、领域感知分词,以及在有限算力条件下(2×A100 GPU)的稠密Transformer模型训练。通过24组实验,我们系统分析了训练稳定性、扩展特性、数据损耗规律及基础设施瓶颈。研究发现:预处理策略显著影响可用标记数量,分词方案制约符号稳定性,存储与I/O限制可能成为比算力更关键的制约因素。我们进一步解析收敛动态,证明在充足数据条件下(520亿预训练标记)可实现稳定训练。本文未提出新颖架构,而是立足工程实践,透明呈现了从小规模起步训练科学语言模型的全貌。期望这些洞见能为中等算力条件下构建领域专用模型的研究者提供参考。
English
While frontier large language models demonstrate strong reasoning and mathematical capabilities, the practical process of training domain-specialized scientific language models from raw sources remains under-documented. In this work, we present a detailed case study of training a 1.36B-parameter scientific language model directly from raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics. We describe an end-to-end pipeline covering metadata filtering, archive validation, LaTeX extraction, text normalization, domain-aware tokenization, and dense transformer training under constrained compute (2xA100 GPUs). Through 24 experimental runs, we analyze training stability, scaling behavior, data yield losses, and infrastructure bottlenecks. Our findings highlight how preprocessing decisions significantly affect usable token volume, how tokenization impacts symbolic stability, and how storage and I/O constraints can rival compute as limiting factors. We further analyze convergence dynamics and show stable training behavior in a data-rich regime (52B pretraining tokens). Rather than proposing a novel architecture, this work provides an engineering-grounded, transparent account of training a small scientific language model from scratch. We hope these insights support researchers operating under moderate compute budgets who seek to build domain-specialized models.