测试时缩放使过度训练实现计算最优

摘要

现代大型语言模型在测试时具有扩展性，例如通过重复采样，其中推理成本随模型规模和采样次数增长。这形成了Chinchilla等预训练扩展定律未能解决的权衡关系。我们提出训练到测试（T^2）扩展定律，在固定端到端预算下联合优化模型规模、训练令牌量和推理样本数。T^2通过引入用于测试时扩展的pass@k建模方法革新了预训练扩展定律，进而联合优化预训练与测试时决策。T^2的预测在不同建模方法中表现出稳健性：既能衡量对任务损失的联合扩展效应，也能建模对任务准确率的影响。在八项下游任务中，我们发现当考虑推理成本时，最优预训练决策会显著转向过训练区域，完全超出标准预训练扩展套件的范围。我们通过在T^2扩展预测的最优区域预训练重度过训练模型来验证结果，证实其性能相比单纯预训练扩展有显著提升。最后，鉴于前沿大语言模型普遍采用后训练阶段，我们证明这些发现在后训练阶段依然成立，使得T^2扩展在现代部署中具有实际意义。

English

Modern LLMs scale at test-time, e.g. via repeated sampling, where inference cost grows with model size and the number of samples. This creates a trade-off that pretraining scaling laws, such as Chinchilla, do not address. We present Train-to-Test (T^2) scaling laws that jointly optimize model size, training tokens, and number of inference samples under fixed end-to-end budgets. T^2 modernizes pretraining scaling laws with pass@k modeling used for test-time scaling, then jointly optimizes pretraining and test-time decisions. Forecasts from T^2 are robust over distinct modeling approaches: measuring joint scaling effect on the task loss and modeling impact on task accuracy. Across eight downstream tasks, we find that when accounting for inference cost, optimal pretraining decisions shift radically into the overtraining regime, well-outside of the range of standard pretraining scaling suites. We validate our results by pretraining heavily overtrained models in the optimal region that T^2 scaling forecasts, confirming their substantially stronger performance compared to pretraining scaling alone. Finally, as frontier LLMs are post-trained, we show that our findings survive the post-training stage, making T^2 scaling meaningful in modern deployments.