批判性校准：评论能否提升大型语言模型的不确定性或置信度校准？

摘要

大型语言模型（LLM）的精准置信度校准对其在高风险领域的安全应用至关重要，清晰的言语化置信度表达能有效增强用户信任。传统方法虽能模仿标准置信度表达形式，却往往无法捕捉准确置信度评估所需的推理过程。我们提出自然语言批判作为解决方案——由于精确的黄金置信度标签难以获取且常需多次生成，该方法特别适合置信度校准。本文研究自然语言批判如何增强言语化置信度，重点解决两大问题：（1）批判对象：应针对不确定性（问题导向）还是置信度（答案特异性）？分析表明，置信度适用于多项选择任务，而不确定性在开放场景中表现更优；（2）批判方式：采用自我批判还是批判校准训练？我们提出两种创新方法：使LLM能超越单纯准确率进行置信度自我批判优化的"自我批判"机制，以及利用自然语言批判改进置信度校准的新型训练方法"批判校准"——该方法突破了直接数值优化的局限。实验表明，批判校准显著优于自我批判及其他竞争基线，在复杂推理任务中甚至超越其教师模型GPT-4o。批判校准在分布外场景中也展现出强大的泛化能力，为提升LLM可靠性开辟了新路径。

English

Accurate confidence calibration in Large Language Models (LLMs) is critical for safe use in high-stakes domains, where clear verbalized confidence enhances user trust. Traditional methods that mimic reference confidence expressions often fail to capture the reasoning needed for accurate confidence assessment. We propose natural language critiques as a solution, ideally suited for confidence calibration, as precise gold confidence labels are hard to obtain and often require multiple generations. This paper studies how natural language critiques can enhance verbalized confidence, addressing: (1) What to critique: uncertainty (question-focused) or confidence (answer-specific)? Analysis shows confidence suits multiple-choice tasks, while uncertainty excels in open-ended scenarios. (2) How to critique: self-critique or critique calibration training? We propose Self-Critique, enabling LLMs to critique and optimize their confidence beyond mere accuracy, and CritiCal, a novel Critique Calibration training method that leverages natural language critiques to improve confidence calibration, moving beyond direct numerical optimization. Experiments show that CritiCal significantly outperforms Self-Critique and other competitive baselines, even surpassing its teacher model, GPT-4o, in complex reasoning tasks. CritiCal also shows robust generalization in out-of-distribution settings, advancing LLM's reliability.

批判性校准：评论能否提升大型语言模型的不确定性或置信度校准？

CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration?

摘要

Support