使用代理指标预测大语言模型的下游性能

摘要

语言模型开发的进展往往依赖于比较性决策：采用哪种架构、使用哪个预训练语料库、应用何种训练方案。要做出明智的决策，需要可靠的性能预测，然而两种常用的信号存在根本性局限。交叉熵损失与下游能力对齐性差，而直接下游评估成本高昂、数据稀疏，且在训练早期往往信息量不足。为此，我们提出通过聚合候选模型在专家撰写的解决方案上的下一个词元分布中的词元级统计量（如熵、top-k准确率和专家词元排名）来构建代理指标。在三种场景下，我们的代理指标始终优于基于损失和计算量的基线方法：1）跨家族模型选择时，它们对异构推理模型群体的排名平均斯皮尔曼相关系数达到0.81（而交叉熵损失仅为0.36）；2）预训练数据选择时，它们能以约一万分之一的计算量（相比直接评估）可靠地对目标模型的25个候选语料库进行排名，将帕累托前沿推向超越现有方法；3）训练时预测时，它们能在18倍计算量跨度内外推下游准确率，误差约为现有替代方案的一半。综合这些结果，专家轨迹是评估模型能力的广泛有用信号源，可在整个模型开发生命周期中实现可靠的性能预测。

English

Progress in language model development is often driven by comparative decisions: which architecture to adopt, which pretraining corpus to use, or which training recipe to apply. Making these decisions well requires reliable performance forecasts, yet the two commonly used signals are fundamentally limited. Cross-entropy loss is poorly aligned with downstream capabilities, and direct downstream evaluation is expensive, sparse, and often uninformative at early training stages. Instead, we propose to construct proxy metrics by aggregating token-level statistics, such as entropy, top-k accuracy, and expert token rank, from a candidate model's next token distribution over expert-written solutions. Across three settings, our proxies consistently outperform loss- and compute-based baselines: 1) For cross-family model selection, they rank a heterogeneous population of reasoning models with mean Spearman Rho = 0.81 (vs. Rho = 0.36 for cross-entropy loss); 2) For pretraining data selection, they reliably rank 25 candidate corpora for a target model at roughly 10{,}000times less compute than direct evaluation, pushing the Pareto frontier beyond existing methods; and 3) for training-time forecasting, they extrapolate downstream accuracy across an 18times compute horizon with roughly half the error of existing alternatives. Together, these results suggest that expert trajectories are a broadly useful source of signal for assessing model capabilities, enabling reliable performance forecasting throughout the model development life cycle.