自我評估早已存在：以極少量數據激發基礎大型語言模型的潛在評判校準

摘要

大型语言模型正越来越多地由其他模型进行评估，这引发了一个自然问题：模型能否预测评审者对其自身输出的评分？我们发现，在未进行任何针对性训练的情况下，这种能力已广泛存在：通过少量示例提示，基础模型在三个基准测试中，对开放式回复的多属性质量评分，其预测结果已显著高于随机水平。我们提出了自我评估诱发（Self-Evaluation Elicitation, SEE）方法，该方法通过一个短周期来挖掘这种潜在能力：该周期包含一个结合校准的强化学习阶段，用于改进回答并预测评审者，随后是一个掩码蒸馏阶段，在保持回答不变的同时优化预测结果。与强化学习基线相比，仅利用160个独特示例（约为其31倍的数量），SEE在三个基准测试中改善了留出校准性能，同时保持了回答质量。诱发的自我评估能力被精确地定位在模型自身的 token 分布中，并且对于从未训练过的评审者具有稳定性，这表明其反映的是可迁移的质量概念，而非单一评审者的偏好。这些结果将基于评审者校准的自我评估问题重新定义为诱发而非习得的问题。

English

Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training: prompted few-shot, a base model already predicts an external judge's multi-attribute quality scores on open-ended responses well above chance across three benchmarks. We introduce Self-Evaluation Elicitation (SEE), a method that surfaces this latent ability through a short cycle comprising a calibration-coupled reinforcement learning phase that improves the answer and predicts the judge, followed by a masked distillation phase that sharpens the prediction while leaving the answer untouched. From 160 unique examples, roughly 31x fewer than a reinforcement learning baseline, SEE improves held-out calibration across three benchmarks while preserving answer quality. The elicited self-evaluation is sharply localized within the model's own token distribution and stable across judges it was never trained against, indicating a transferable notion of quality rather than a single judge's preference. These results reframe judge-aligned self-evaluation as a problem of elicitation rather than acquisition.