寶石：多面向比例定律的模型套件

摘要

通常使用一系列具有狹窄凍結超參數選擇範圍的模型族來擬合比例律。在這項研究中，我們使用廣泛的架構和超參數選擇來研究比例律，並強調它們對結果處方的影響。作為我們研究的主要成果，我們發布了 Gemstones：迄今為止最全面的開源比例律數據集，包括來自具有高達 20 億參數的變壓器的 4000 多個檢查點；這些模型已使用不同的學習率、冷卻計劃和架構形狀進行訓練。我們的檢查點使得能夠進行更複雜的比例研究，例如一個預測語言建模性能作為模型寬度和深度函數的法則。通過檢驗我們模型套件的各個方面，我們發現比例律的處方可能對實驗設計過程和擬合期間使用的具體模型檢查點非常敏感。程式碼：https://github.com/mcleish7/gemstone-scaling-laws

English

Scaling laws are typically fit using a family of models with a narrow range of frozen hyper-parameter choices. In this work we study scaling laws using a wide range of architecture and hyper-parameter choices, and highlight their impact on resulting prescriptions. As a primary artifact of our research, we release the Gemstones: the most comprehensive open-source scaling law dataset to date, consisting of over 4000 checkpoints from transformers with up to 2 billion parameters; these models have been trained with different learning rates, cooldown schedules, and architectural shapes. Our checkpoints enable more complex studies of scaling, such as a law that predicts language modeling performance as a function of model width and depth. By examining the various facets of our model suite, we find that the prescriptions of scaling laws can be highly sensitive to the experimental design process and the specific model checkpoints used during fitting. Code: https://github.com/mcleish7/gemstone-scaling-laws

寶石：多面向比例定律的模型套件

Gemstones: A Model Suite for Multi-Faceted Scaling Laws

摘要

Support