自信评判：将自动评分器校准至偏好分布

摘要

大型语言模型（LLMs）与人类价值观的对齐日益依赖于将其他LLMs作为自动化评判者，即“自动评分器”。然而，其可靠性受到一个根本性问题的限制：它们是在离散的偏好标签上训练的，这迫使单一的标准答案应用于往往主观、模糊或微妙的任务。我们认为，一个可靠的自动评分器必须学会建模目标人群定义的完整偏好分布。本文中，我们提出了一个通用框架，用于将概率型自动评分器校准至任意给定的偏好分布。我们形式化了这一问题，并针对不同数据条件提出了两种学习方法：1）针对密集、概率标签的直接监督微调；2）针对稀疏、二元标签的强化学习策略。实证结果表明，以分布匹配为目标微调自动评分器，能够使口头表达的概率预测更好地与目标偏好分布对齐，同时提升校准度，显著降低位置偏差，且在客观任务上保持性能不变。

English

The alignment of large language models (LLMs) with human values increasingly relies on using other LLMs as automated judges, or ``autoraters''. However, their reliability is limited by a foundational issue: they are trained on discrete preference labels, forcing a single ground truth onto tasks that are often subjective, ambiguous, or nuanced. We argue that a reliable autorater must learn to model the full distribution of preferences defined by a target population. In this paper, we propose a general framework for calibrating probabilistic autoraters to any given preference distribution. We formalize the problem and present two learning methods tailored to different data conditions: 1) a direct supervised fine-tuning for dense, probabilistic labels, and 2) a reinforcement learning approach for sparse, binary labels. Our empirical results show that finetuning autoraters with a distribution-matching objective leads to verbalized probability predictions that are better aligned with the target preference distribution, with improved calibration and significantly lower positional bias, all while preserving performance on objective tasks.

自信评判：将自动评分器校准至偏好分布

Judging with Confidence: Calibrating Autoraters to Preference Distributions

摘要

Support