数据受限训练的规范性缩放法则

摘要

训练算力的增长正日益超越高质量数据的可获得性，这使得核心挑战从优化算力分配转向如何从有限数据中提取最大价值。目前广泛采用的Chinchilla缩放定律假设每个训练标记都是唯一的，这限制了其在数据受限场景下指导预训练决策的能力。我们通过简单的加性过拟合惩罚项对重复训练下的超额损失进行建模，发现该模型能准确描述模型行为。我们的缩放定律提出了质变性的算力最优分配方案：超过某个临界点后，继续重复数据将适得其反，此时应将算力投入模型容量提升。实验表明，遵循本定律推荐的配置能在数据受限场景下提升性能。最后，由于我们的单参数形式将过拟合效应隔离在单一系数中，这使得不同训练配置可直接比较。案例研究表明，强权重衰减（λ=1.0）可使该系数降低约70%，这为"数据受限场景下最优权重衰减需比标准实践高一个数量级"的最新发现提供了缩放定律层面的解释。

English

Training compute is increasingly outpacing the availability of high-quality data. This shifts the central challenge from optimal compute allocation to extracting maximum value from limited data. The widely adopted Chinchilla scaling law assumes every training token is unique. This limits its ability to guide pretraining decisions in data-constrained regimes. We model the excess loss under repetition with a simple additive overfitting penalty and find that it accurately describes model behavior. Our scaling law yields qualitatively new compute-optimal allocation advice. Beyond a point, further repetition is counterproductive and compute is better spent on model capacity. We show that following our law's recommended configuration improves performance in data-constrained regimes. Finally, because our one-parameter form isolates overfitting in a single coefficient, it enables direct comparison across training configurations. As a case study, we show that strong weight decay (λ=1.0) reduces this coefficient by approximately 70%, providing a scaling-law explanation for recent findings that optimal weight decay in data-constrained regimes is an order of magnitude larger than standard practice.

数据受限训练的规范性缩放法则

Prescriptive Scaling Laws for Data Constrained Training

摘要

Support