daVinci-LLM：事前学習の科学に向けて

要旨

基礎事前学習段階は、モデルの能力限界を決定づける。なぜなら、学習後調整では事前学習で確立された能力基盤を克服することが困難であるにもかかわらず、この分野は依然として著しく未開拓だからである。この状況は構造的パラドックスに起因する：計算資源を持つ組織は透明性のある開示を妨げる商業的圧力の下で運営される一方で、学術機関は研究の自由を持つが事前学習規模の計算資源を欠いている。daVinci-LLMは、この未開拓の交差点に位置し、産業規模の資源と完全な研究の自由を組み合わせることで、事前学習の科学を推進する。我々は、開放性を科学的方法論として扱う完全オープンなパラダイムを採用し、完全なデータ処理パイプライン、訓練プロセス全体、体系的な探索結果を公開する。本分野が体系的なデータ処理方法論を欠いていることを認識し、フィルタリングから合成に至る原則的なL0-L9分類体系であるData Darwinismフレームワークを採用する。我々は、基礎能力から推論集約的強化へと段階的に移行する2段階適応型カリキュラムを用い、8Tトークンにわたりランダム初期化から3Bパラメータモデルを訓練した。200以上の制御された ablation 実験を通じて、以下のことを明らかにした：処理の深さが体系的に能力を向上させ、量のスケーリングと並ぶ重要な次元であること；異なる領域が特有の飽和 dynamics を示し、比率調整から形式変更に至る適応戦略を必要とすること；構成のバランスが性能崩壊を防ぎながら標的的な強化を可能にすること；評価プロトコルの選択が事前学習の進捗理解を如何に形成するか。探索プロセス全体を公開することで、我々の知見と体系的方法論に基づいてコミュニティが累積的な科学的知見を形成することを可能にする。

English

The foundational pretraining phase determines a model's capability ceiling, as post-training struggles to overcome capability foundations established during pretraining, yet it remains critically under-explored. This stems from a structural paradox: organizations with computational resources operate under commercial pressures that inhibit transparent disclosure, while academic institutions possess research freedom but lack pretraining-scale computational resources. daVinci-LLM occupies this unexplored intersection, combining industrial-scale resources with full research freedom to advance the science of pretraining. We adopt a fully-open paradigm that treats openness as scientific methodology, releasing complete data processing pipelines, full training processes, and systematic exploration results. Recognizing that the field lacks systematic methodology for data processing, we employ the Data Darwinism framework, a principled L0-L9 taxonomy from filtering to synthesis. We train a 3B-parameter model from random initialization across 8T tokens using a two-stage adaptive curriculum that progressively shifts from foundational capabilities to reasoning-intensive enhancement. Through 200+ controlled ablations, we establish that: processing depth systematically enhances capabilities, establishing it as a critical dimension alongside volume scaling; different domains exhibit distinct saturation dynamics, necessitating adaptive strategies from proportion adjustments to format shifts; compositional balance enables targeted intensification while preventing performance collapse; how evaluation protocol choices shape our understanding of pretraining progress. By releasing the complete exploration process, we enable the community to build upon our findings and systematic methodologies to form accumulative scientific knowledge in pretraining.

daVinci-LLM：事前学習の科学に向けて

daVinci-LLM:Towards the Science of Pretraining

要旨

Support