語言模型偏好所熟悉的事物：透過信心偏好進行相對信心估計

摘要

語言模型（LMs）應提供可靠的信心估計，以幫助使用者在偵測輸出中的錯誤並在必要時諮詢人類專家。要求語言模型評估其信心（"從0到1評分您的信心。"）是評估其不確定性的自然方式。然而，模型難以提供對信心的絕對評估（即獨立於其他問題回答問題時的信心評判），並且它們生成的粗粒度分數對於評估其答案的正確性並不有用。我們提出相對信心估計，其中我們將問題相互比較，要求模型對信心進行相對判斷（"您在回答正確哪個問題上更有信心？"）。將每個問題視為一個"選手"在與其他問題的一系列比賽中進行比較，並將模型的偏好視為比賽結果，我們可以使用Elo評分和Bradley-Terry等排名聚合方法將模型的信心偏好轉換為信心分數。我們在五個最先進的LM（GPT-4、GPT-4o、Gemini 1.5 Pro、Claude 3.5 Sonnet和Llama 3.1 405B）上對相對信心估計進行評估，涵蓋14個具有挑戰性的STEM、社會科學和常識推理問答任務。我們的結果表明，相對信心估計一致地提供比絕對信心估計更可靠的信心分數，對於直接絕對信心估計方法的選擇性分類AUC平均增益為3.5％，對於所有模型和數據集，相對於自一致性方法的增益為1.7％。

English

Language models (LMs) should provide reliable confidence estimates to help users detect mistakes in their outputs and defer to human experts when necessary. Asking a language model to assess its confidence ("Score your confidence from 0-1.") is a natural way of evaluating its uncertainty. However, models struggle to provide absolute assessments of confidence (i.e. judging confidence in answering a question independent of other questions) and the coarse-grained scores they produce are not useful for evaluating the correctness of their answers. We propose relative confidence estimation, where we match up questions against each other and ask the model to make relative judgments of confidence ("Which question are you more confident in answering correctly?"). Treating each question as a "player" in a series of matchups against other questions and the model's preferences as match outcomes, we can use rank aggregation methods like Elo rating and Bradley-Terry to translate the model's confidence preferences into confidence scores. We evaluate relative confidence estimation against absolute confidence estimation and self-consistency confidence methods on five state-of-the-art LMs -- GPT-4, GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and Llama 3.1 405B -- across 14 challenging STEM, social science, and commonsense reasoning question answering tasks. Our results demonstrate that relative confidence estimation consistently provides more reliable confidence scores than absolute confidence estimation, with average gains of 3.5% in selective classification AUC over direct absolute confidence estimation methods and 1.7% over self-consistency approaches across all models and datasets.

語言模型偏好所熟悉的事物：透過信心偏好進行相對信心估計

Language Models Prefer What They Know: Relative Confidence Estimation via Confidence Preferences

摘要

Support