自信判斷：將自動評分器校準至偏好分佈

摘要

大型語言模型（LLMs）與人類價值觀的對齊日益依賴於使用其他LLMs作為自動評判者，或稱“自動評分器”。然而，其可靠性受到一個根本性問題的限制：這些模型是在離散偏好標籤上進行訓練的，這迫使單一“真實答案”應用於往往是主觀、模糊或細微的任務上。我們主張，一個可靠的自動評分器必須學會模擬目標人群所定義的偏好全分佈。本文中，我們提出了一個通用框架，用於將概率性自動評分器校準至任何給定的偏好分佈。我們形式化了這一問題，並針對不同數據條件提出了兩種學習方法：1）針對密集、概率性標籤的直接監督微調，以及2）針對稀疏、二進制標籤的強化學習方法。我們的實證結果表明，以分佈匹配為目標對自動評分器進行微調，能夠使口頭表達的概率預測更好地與目標偏好分佈對齊，同時改善校準度並顯著降低位置偏差，且在所有客觀任務上保持性能不變。

English

The alignment of large language models (LLMs) with human values increasingly relies on using other LLMs as automated judges, or ``autoraters''. However, their reliability is limited by a foundational issue: they are trained on discrete preference labels, forcing a single ground truth onto tasks that are often subjective, ambiguous, or nuanced. We argue that a reliable autorater must learn to model the full distribution of preferences defined by a target population. In this paper, we propose a general framework for calibrating probabilistic autoraters to any given preference distribution. We formalize the problem and present two learning methods tailored to different data conditions: 1) a direct supervised fine-tuning for dense, probabilistic labels, and 2) a reinforcement learning approach for sparse, binary labels. Our empirical results show that finetuning autoraters with a distribution-matching objective leads to verbalized probability predictions that are better aligned with the target preference distribution, with improved calibration and significantly lower positional bias, all while preserving performance on objective tasks.

自信判斷：將自動評分器校準至偏好分佈

Judging with Confidence: Calibrating Autoraters to Preference Distributions

摘要

Support