テストタイムスケーリングによる過学習の計算最適化

要旨

現代の大規模言語モデル（LLM）は、推論時に反復サンプリングなどの手法でスケールするため、推論コストはモデルサイズとサンプル数に比例して増加します。これにより、Chinchillaのような従来の事前学習スケーリング則では扱えないトレードオフが生じています。本論文では、エンドツーエンドの予算制約下でモデルサイズ、学習トークン数、推論サンプル数を同時最適化する「Train-to-Test（T^2）スケーリング則」を提案します。T^2は、推論時スケーリングで用いられるpass@kモデリングを事前学習スケーリング則に導入し、事前学習と推論時の意思決定を統合的に最適化します。T^2に基づく予測は、タスク損失に対する統合的なスケーリング効果の計測、およびタスク精度への影響モデリングにおいて、異なるモデリング手法にわたって頑健性を示します。8つの下流タスクにおける実験では、推論コストを考慮した場合、最適な事前学習の設定が「過学習領域」に劇的にシフトし、従来の事前学習スケーリング則の想定範囲を大きく逸脱することが明らかになりました。この知見を検証するため、T^2スケーリングが示す最適領域で過学習させたモデルを実際に事前学習し、従来の事前学習のみのアプローチよりも性能が大幅に向上することを確認しました。最後に、最先端LLMが事後学習を経る現状を踏まえ、我々の知見が事後学習後も維持されることを示し、T^2スケーリングが現代的なLLM展開において有意義であることを実証します。

English

Modern LLMs scale at test-time, e.g. via repeated sampling, where inference cost grows with model size and the number of samples. This creates a trade-off that pretraining scaling laws, such as Chinchilla, do not address. We present Train-to-Test (T^2) scaling laws that jointly optimize model size, training tokens, and number of inference samples under fixed end-to-end budgets. T^2 modernizes pretraining scaling laws with pass@k modeling used for test-time scaling, then jointly optimizes pretraining and test-time decisions. Forecasts from T^2 are robust over distinct modeling approaches: measuring joint scaling effect on the task loss and modeling impact on task accuracy. Across eight downstream tasks, we find that when accounting for inference cost, optimal pretraining decisions shift radically into the overtraining regime, well-outside of the range of standard pretraining scaling suites. We validate our results by pretraining heavily overtrained models in the optimal region that T^2 scaling forecasts, confirming their substantially stronger performance compared to pretraining scaling alone. Finally, as frontier LLMs are post-trained, we show that our findings survive the post-training stage, making T^2 scaling meaningful in modern deployments.

テストタイムスケーリングによる過学習の計算最適化

Test-Time Scaling Makes Overtraining Compute-Optimal

要旨

Support