ArXiv到模型：科學語言模型訓練的實務研究

摘要

雖然前沿大型語言模型展現出強大的推理與數學能力，但從原始資料訓練領域專用科學語言模型的實際流程仍缺乏系統性記錄。本研究透過具體案例，詳細闡述如何直接從涵蓋數學、電腦科學與理論物理的arXiv LaTeX原始資料，訓練一個13.6億參數的科學語言模型。我們提出端到端的流程，包括元資料篩選、檔案驗證、LaTeX提取、文本正規化、領域感知分詞，以及在有限算力（2張A100 GPU）下的稠密轉換器訓練。透過24組實驗運行，我們分析訓練穩定性、擴展規律、數據損耗率與基礎設施瓶頸。研究發現凸顯預處理決策如何顯著影響可用標記量、分詞策略如何影響符號穩定性，以及儲存與I/O限制如何成為與算力並行的制約因素。我們進一步分析收斂動態，證實在豐富數據環境（520億預訓練標記）下能保持穩定訓練。有別於提出新架構，本研究立足工程實踐，透明呈現從零訓練小型科學語言模型的完整過程，盼能為中等算力下建構領域專用模型的研究者提供實務參考。

English

While frontier large language models demonstrate strong reasoning and mathematical capabilities, the practical process of training domain-specialized scientific language models from raw sources remains under-documented. In this work, we present a detailed case study of training a 1.36B-parameter scientific language model directly from raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics. We describe an end-to-end pipeline covering metadata filtering, archive validation, LaTeX extraction, text normalization, domain-aware tokenization, and dense transformer training under constrained compute (2xA100 GPUs). Through 24 experimental runs, we analyze training stability, scaling behavior, data yield losses, and infrastructure bottlenecks. Our findings highlight how preprocessing decisions significantly affect usable token volume, how tokenization impacts symbolic stability, and how storage and I/O constraints can rival compute as limiting factors. We further analyze convergence dynamics and show stable training behavior in a data-rich regime (52B pretraining tokens). Rather than proposing a novel architecture, this work provides an engineering-grounded, transparent account of training a small scientific language model from scratch. We hope these insights support researchers operating under moderate compute budgets who seek to build domain-specialized models.

ArXiv到模型：科學語言模型訓練的實務研究

ArXiv-to-Model: A Practical Study of Scientific LM Training

摘要

Support