자기 평가는 이미 존재한다: 최소 데이터로 기본 LLM의 잠재적 판단 보정 유도하기

초록

대규모 언어 모델이 점점 더 다른 모델에 의해 평가되면서, 자연스러운 질문이 제기된다: 모델이 자신의 출력에 대해 평가자가 어떻게 점수를 부여할지 예측할 수 있을까? 우리는 이러한 능력이 표적 훈련 전에 이미 상당 부분 존재함을 발견했다. 프롬프트된 퓨샷(few-shot) 상황에서 기본 모델은 세 가지 벤치마크에 걸쳐 개방형 응답에 대한 외부 평가자의 다중 속성 품질 점수를 우연 수준을 훨씬 상회하여 예측한다. 우리는 자기 평가 유도(Self-Evaluation Elicitation, SEE) 방법을 소개한다. 이 방법은 보정 결합 강화 학습 단계(답변을 개선하고 평가자를 예측함)와 그 뒤를 이어 답변은 건드리지 않고 예측을 정교화하는 마스크 증류 단계로 구성된 짧은 주기를 통해 이러한 잠재 능력을 표면화한다. 강화 학습 기준선보다 약 31배 적은 160개의 고유 예제로부터, SEE는 답변 품질을 유지하면서 세 가지 벤치마크에 걸쳐 보류된 보정(held-out calibration)을 개선한다. 유도된 자기 평가는 모델 자체의 토큰 분포 내에 뚜렷하게 국한되며, 훈련에 사용되지 않은 평가자들에 대해서도 안정적이다. 이는 단일 평가자의 선호보다는 전이 가능한 품질 개념을 나타낸다. 이러한 결과는 평가자 정렬 자기 평가를 획득(acquisition)이 아닌 유도(elicitation)의 문제로 재구성한다.

English

Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training: prompted few-shot, a base model already predicts an external judge's multi-attribute quality scores on open-ended responses well above chance across three benchmarks. We introduce Self-Evaluation Elicitation (SEE), a method that surfaces this latent ability through a short cycle comprising a calibration-coupled reinforcement learning phase that improves the answer and predicts the judge, followed by a masked distillation phase that sharpens the prediction while leaving the answer untouched. From 160 unique examples, roughly 31x fewer than a reinforcement learning baseline, SEE improves held-out calibration across three benchmarks while preserving answer quality. The elicited self-evaluation is sharply localized within the model's own token distribution and stable across judges it was never trained against, indicating a transferable notion of quality rather than a single judge's preference. These results reframe judge-aligned self-evaluation as a problem of elicitation rather than acquisition.