プロキシメトリクスを用いたLLMの下流性能の予測

要旨

言語モデルの開発における進歩は、しばしば比較による判断に左右される。どのアーキテクチャを採用するか、どの事前学習コーパスを使用するか、どの訓練レシピを適用するか、といった判断である。これらの判断を適切に行うには、信頼性の高い性能予測が必要である。しかし、一般的に使われる2つの指標には根本的な限界がある。クロスエントロピー損失は下流タスクの能力との整合性が低く、直接的な下流評価はコストが高く、疎であり、訓練初期には情報量が少ないことが多い。代わりに、我々は代理指標を構築することを提案する。これは、候補モデルの専門家作成ソリューションに対する次トークン分布から、エントロピー、トップk精度、専門家トークン順位などのトークンレベルの統計量を集約したものである。3つの設定において、我々の代理指標は一貫して損失ベースおよび計算量ベースのベースラインを上回った。1) 異種ファミリー間のモデル選択では、多様な推論モデル群を平均スピアマンρ = 0.81で順位付けした（クロスエントロピー損失ではρ = 0.36）。2) 事前学習データ選択では、ターゲットモデルに対する25の候補コーパスを、直接評価の約1万分の1の計算コストで信頼性高く順位付けし、パレートフロンティアを既存手法を超えて押し広げた。3) 訓練時予測では、18倍の計算量にわたる下流精度を、既存の代替手法の約半分の誤差で外挿した。これらの結果は、専門家の軌跡がモデルの能力評価に幅広く有用な信号源であり、モデル開発ライフサイクル全体にわたって信頼性の高い性能予測を可能にすることを示唆している。

English

Progress in language model development is often driven by comparative decisions: which architecture to adopt, which pretraining corpus to use, or which training recipe to apply. Making these decisions well requires reliable performance forecasts, yet the two commonly used signals are fundamentally limited. Cross-entropy loss is poorly aligned with downstream capabilities, and direct downstream evaluation is expensive, sparse, and often uninformative at early training stages. Instead, we propose to construct proxy metrics by aggregating token-level statistics, such as entropy, top-k accuracy, and expert token rank, from a candidate model's next token distribution over expert-written solutions. Across three settings, our proxies consistently outperform loss- and compute-based baselines: 1) For cross-family model selection, they rank a heterogeneous population of reasoning models with mean Spearman Rho = 0.81 (vs. Rho = 0.36 for cross-entropy loss); 2) For pretraining data selection, they reliably rank 25 candidate corpora for a target model at roughly 10{,}000times less compute than direct evaluation, pushing the Pareto frontier beyond existing methods; and 3) for training-time forecasting, they extrapolate downstream accuracy across an 18times compute horizon with roughly half the error of existing alternatives. Together, these results suggest that expert trajectories are a broadly useful source of signal for assessing model capabilities, enabling reliable performance forecasting throughout the model development life cycle.