신뢰를 바탕으로 판단하기: 선호 분포에 맞춘 자동 평가자 보정

초록

대규모 언어 모델(LLM)을 인간의 가치와 조율하는 데 있어 다른 LLM을 자동 평가자 또는 "자동 평가기"로 활용하는 비중이 점차 증가하고 있습니다. 그러나 이러한 자동 평가기의 신뢰성은 근본적인 문제로 인해 제한적입니다. 이들은 이산적인 선호도 레이블에 대해 훈련되어, 주관적이거나 모호하며 미묘한 차이가 있는 작업에 대해 단일한 정답을 강제하게 됩니다. 우리는 신뢰할 수 있는 자동 평가기가 목표 집단에 의해 정의된 선호도의 전체 분포를 모델링할 수 있어야 한다고 주장합니다. 본 논문에서는 주어진 선호도 분포에 대해 확률적 자동 평가기를 보정하기 위한 일반적인 프레임워크를 제안합니다. 문제를 공식화하고, 서로 다른 데이터 조건에 맞춘 두 가지 학습 방법을 제시합니다: 1) 밀집된 확률적 레이블에 대한 직접적인 지도 미세조정, 2) 희소한 이진 레이블에 대한 강화 학습 접근법. 실험 결과, 분포 일치 목표로 자동 평가기를 미세조정하면 목표 선호도 분포와 더 잘 조율된 언어화된 확률 예측이 가능해지며, 보정이 개선되고 위치 편향이 크게 감소하는 동시에 객관적 작업에서의 성능을 유지할 수 있음을 보여줍니다.

English

The alignment of large language models (LLMs) with human values increasingly relies on using other LLMs as automated judges, or ``autoraters''. However, their reliability is limited by a foundational issue: they are trained on discrete preference labels, forcing a single ground truth onto tasks that are often subjective, ambiguous, or nuanced. We argue that a reliable autorater must learn to model the full distribution of preferences defined by a target population. In this paper, we propose a general framework for calibrating probabilistic autoraters to any given preference distribution. We formalize the problem and present two learning methods tailored to different data conditions: 1) a direct supervised fine-tuning for dense, probabilistic labels, and 2) a reinforcement learning approach for sparse, binary labels. Our empirical results show that finetuning autoraters with a distribution-matching objective leads to verbalized probability predictions that are better aligned with the target preference distribution, with improved calibration and significantly lower positional bias, all while preserving performance on objective tasks.

신뢰를 바탕으로 판단하기: 선호 분포에 맞춘 자동 평가자 보정

Judging with Confidence: Calibrating Autoraters to Preference Distributions

초록

Support