自信を持って判断する：選好分布に基づく自動評価器の較正

要旨

大規模言語モデル（LLMs）の人間の価値観との整合性を図るため、他のLLMsを自動評価者（「オートレーター」）として利用する手法が増えている。しかし、その信頼性は根本的な課題によって制限されている。これらのモデルは離散的な選好ラベルで訓練されており、主観的、曖昧、または微妙なタスクに対して単一の正解を強制している。我々は、信頼性のあるオートレーターは、対象とする集団が定義する選好の完全な分布をモデル化することを学ぶ必要があると主張する。本論文では、任意の選好分布に対して確率的オートレーターを較正するための一般的なフレームワークを提案する。この問題を形式化し、異なるデータ条件に適した2つの学習方法を提示する：1）密な確率的ラベルに対する直接的な教師ありファインチューニング、および2）疎な二値ラベルに対する強化学習アプローチである。実証結果から、分布整合性を目的としたファインチューニングを行うことで、オートレーターの確率予測が対象選好分布とより整合し、較正が改善され、位置バイアスが大幅に低減されることが示された。さらに、客観的タスクにおける性能も維持されることが確認された。

English

The alignment of large language models (LLMs) with human values increasingly relies on using other LLMs as automated judges, or ``autoraters''. However, their reliability is limited by a foundational issue: they are trained on discrete preference labels, forcing a single ground truth onto tasks that are often subjective, ambiguous, or nuanced. We argue that a reliable autorater must learn to model the full distribution of preferences defined by a target population. In this paper, we propose a general framework for calibrating probabilistic autoraters to any given preference distribution. We formalize the problem and present two learning methods tailored to different data conditions: 1) a direct supervised fine-tuning for dense, probabilistic labels, and 2) a reinforcement learning approach for sparse, binary labels. Our empirical results show that finetuning autoraters with a distribution-matching objective leads to verbalized probability predictions that are better aligned with the target preference distribution, with improved calibration and significantly lower positional bias, all while preserving performance on objective tasks.

自信を持って判断する：選好分布に基づく自動評価器の較正

Judging with Confidence: Calibrating Autoraters to Preference Distributions

要旨

Support