自己評価はすでに存在する：最小限のデータでベースLLMの潜在的な判定器校正を引き出す

要旨

大規模言語モデルは、他のモデルによって評価されることが増えており、自然な疑問が生じる。すなわち、モデルは自身の出力を評価者がどのようにスコアリングするかを予測できるのか。我々は、この能力が対象を絞った訓練を施す前からほぼ備わっていることを見出した。すなわち、数発のプロンプトを与えられたベースモデルは、三つのベンチマークにわたって、自由形式の応答に対する外部評価者の複数属性の品質スコアを、偶然を大きく上回る精度で既に予測できるのである。我々は、自己評価誘発法（Self-Evaluation Elicitation, SEE）を導入する。これは、短いサイクルを通じてこの潜在能力を表面化させる手法であり、キャリブレーションと連携した強化学習フェーズ（回答を改善し、評価者を予測する）と、それに続くマスク蒸留フェーズ（回答に手を加えずに予測を精緻化する）から構成される。160のユニークな例（強化学習ベースラインの約31分の1の数）から、SEEは三つのベンチマークにわたってホールドアウトキャリブレーションを改善し、回答品質を維持する。誘発された自己評価は、モデル自身のトークン分布内に鋭く局在化しており、訓練時に一度も使用されなかった評価者に対しても安定している。これは、単一の評価者の嗜好ではなく、転移可能な品質概念を示している。これらの結果は、評価者に合わせた自己評価を、獲得の問題ではなく誘発の問題として捉え直すものである。

English

Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training: prompted few-shot, a base model already predicts an external judge's multi-attribute quality scores on open-ended responses well above chance across three benchmarks. We introduce Self-Evaluation Elicitation (SEE), a method that surfaces this latent ability through a short cycle comprising a calibration-coupled reinforcement learning phase that improves the answer and predicts the judge, followed by a masked distillation phase that sharpens the prediction while leaving the answer untouched. From 160 unique examples, roughly 31x fewer than a reinforcement learning baseline, SEE improves held-out calibration across three benchmarks while preserving answer quality. The elicited self-evaluation is sharply localized within the model's own token distribution and stable across judges it was never trained against, indicating a transferable notion of quality rather than a single judge's preference. These results reframe judge-aligned self-evaluation as a problem of elicitation rather than acquisition.