自我评估已然存在：以极少数据激发基础大语言模型中的潜在判断校准

摘要

大型语言模型日益依赖其他模型进行评估，这引发了一个自然问题：模型能否预测评判者将如何评价其自身输出？我们发现，这种能力在针对性训练之前就已普遍存在：通过少样本提示，基础模型在三个基准测试中，已经能以远高于随机水平的准确率，预测外部评判者对开放式回答的多属性质量评分。我们提出了自我评估引导（Self-Evaluation Elicitation, SEE）方法，通过一个短周期来唤醒这种潜在能力，该周期包含一个结合校准的强化学习阶段（用于改进回答并预测评判者），以及一个掩码蒸馏阶段（在保持回答不变的同时提升预测精度）。仅使用160个独立示例（约为强化学习基线数据量的31分之一），SEE在三个基准测试中提升了留出校准效果，同时保持了回答质量。所引导出的自我评估能力精准地定位于模型自身的词元分布之内，并且在从未训练过的评判者面前保持稳定，这表明其体现了一种可迁移的质量概念，而非针对单个评判者的偏好。这些结果将面向评判者的自我评估重新定义为引导问题而非获取问题。

English

Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training: prompted few-shot, a base model already predicts an external judge's multi-attribute quality scores on open-ended responses well above chance across three benchmarks. We introduce Self-Evaluation Elicitation (SEE), a method that surfaces this latent ability through a short cycle comprising a calibration-coupled reinforcement learning phase that improves the answer and predicts the judge, followed by a masked distillation phase that sharpens the prediction while leaving the answer untouched. From 160 unique examples, roughly 31x fewer than a reinforcement learning baseline, SEE improves held-out calibration across three benchmarks while preserving answer quality. The elicited self-evaluation is sharply localized within the model's own token distribution and stable across judges it was never trained against, indicating a transferable notion of quality rather than a single judge's preference. These results reframe judge-aligned self-evaluation as a problem of elicitation rather than acquisition.