使用代理指標預測大型語言模型的下游性能

摘要

語言模型開發的進展，往往來自於比較性的決策：要採用哪種架構、使用哪個預訓練語料庫、或應用哪種訓練配方。要做出明智的決策，就需要可靠的性能預測，然而目前兩種常用的信號卻存在根本上的局限。交叉熵損失與下游能力的對齊程度不佳，而直接進行下游評估不僅成本高昂、稀疏，且在訓練早期往往無法提供有用資訊。為此，我們提出透過匯總候選模型在專家編寫的解決方案上，其下一個 token 分佈的 token 層級統計值（例如熵、top-k 準確率、專家 token 排名）來建構代理指標。在三種情境下，我們的代理指標一致地優於基於損失與計算量的基準方法：1）跨家族模型選擇時，它們能對一組異質的推理模型進行排序，平均斯皮爾曼相關係數（Spearman Rho）為 0.81（而交叉熵損失僅為 0.36）；2）預訓練資料選擇時，它們能以約直接評估所需計算量的萬分之一，可靠地為目標模型排序 25 個候選語料庫，將帕累托前沿推至現有方法之上；3）訓練時預測時，它們能將下游準確率外推至 18 倍的計算量範圍，且誤差約為現有替代方案的一半。綜合這些結果顯示，專家軌跡是評估模型能力時廣泛有用的訊號來源，能在整個模型開發生命週期中實現可靠的性能預測。

English

Progress in language model development is often driven by comparative decisions: which architecture to adopt, which pretraining corpus to use, or which training recipe to apply. Making these decisions well requires reliable performance forecasts, yet the two commonly used signals are fundamentally limited. Cross-entropy loss is poorly aligned with downstream capabilities, and direct downstream evaluation is expensive, sparse, and often uninformative at early training stages. Instead, we propose to construct proxy metrics by aggregating token-level statistics, such as entropy, top-k accuracy, and expert token rank, from a candidate model's next token distribution over expert-written solutions. Across three settings, our proxies consistently outperform loss- and compute-based baselines: 1) For cross-family model selection, they rank a heterogeneous population of reasoning models with mean Spearman Rho = 0.81 (vs. Rho = 0.36 for cross-entropy loss); 2) For pretraining data selection, they reliably rank 25 candidate corpora for a target model at roughly 10{,}000times less compute than direct evaluation, pushing the Pareto frontier beyond existing methods; and 3) for training-time forecasting, they extrapolate downstream accuracy across an 18times compute horizon with roughly half the error of existing alternatives. Together, these results suggest that expert trajectories are a broadly useful source of signal for assessing model capabilities, enabling reliable performance forecasting throughout the model development life cycle.