測試時縮放使過度訓練達到計算最優化

摘要

現代大型語言模型在測試時會進行規模擴展，例如透過重複取樣的方式，這使得推理成本隨著模型規模和取樣數量增長。此現象創造出一個預訓練規模法則（如Chinchilla）未能解決的權衡問題。我們提出「訓練至測試」（T^2）規模法則，在固定端到端預算下聯合優化模型規模、訓練標記數與推理取樣次數。T^2以用於測試時規模擴展的pass@k建模技術革新預訓練規模法則，進而統籌優化預訓練與測試階段的決策。T^2的預測在不同建模方法間展現穩健性：既能衡量聯合規模擴展對任務損失的影響，也能建模其對任務準確度的作用。在八項下游任務中，我們發現當納入推理成本考量時，最佳預訓練決策會急遽轉向過度訓練區域，遠超出標準預訓練規模套件的範圍。我們透過在T^2規模預測的最優區域預訓練重度過度訓練模型來驗證結果，證實其性能相較僅依賴預訓練規模的模型有顯著提升。最後，針對前沿大型語言模型普遍進行後訓練的現狀，我們證明此發現於後訓練階段依然成立，使T^2規模法則在現代部署中具備實質意義。

English

Modern LLMs scale at test-time, e.g. via repeated sampling, where inference cost grows with model size and the number of samples. This creates a trade-off that pretraining scaling laws, such as Chinchilla, do not address. We present Train-to-Test (T^2) scaling laws that jointly optimize model size, training tokens, and number of inference samples under fixed end-to-end budgets. T^2 modernizes pretraining scaling laws with pass@k modeling used for test-time scaling, then jointly optimizes pretraining and test-time decisions. Forecasts from T^2 are robust over distinct modeling approaches: measuring joint scaling effect on the task loss and modeling impact on task accuracy. Across eight downstream tasks, we find that when accounting for inference cost, optimal pretraining decisions shift radically into the overtraining regime, well-outside of the range of standard pretraining scaling suites. We validate our results by pretraining heavily overtrained models in the optimal region that T^2 scaling forecasts, confirming their substantially stronger performance compared to pretraining scaling alone. Finally, as frontier LLMs are post-trained, we show that our findings survive the post-training stage, making T^2 scaling meaningful in modern deployments.

測試時縮放使過度訓練達到計算最優化

Test-Time Scaling Makes Overtraining Compute-Optimal

摘要

Support