ChatPaper.aiChatPaper

批判性评估:评论能否助力大语言模型的不确定性或置信度校准?

CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration?

October 28, 2025
作者: Qing Zong, Jiayu Liu, Tianshi Zheng, Chunyang Li, Baixuan Xu, Haochen Shi, Weiqi Wang, Zhaowei Wang, Chunkit Chan, Yangqiu Song
cs.AI

摘要

大型语言模型(LLM)的精准置信度校准对其在高风险领域的安全应用至关重要,清晰的口头化置信度表达能有效增强用户信任。传统方法虽能模仿标准置信度表达形式,却往往无法捕捉准确评估置信度所需的推理过程。我们提出采用自然语言批评作为解决方案——该方法特别适合置信度校准,因为精确的黄金置信度标签难以获取且常需多次生成。本文研究自然语言批评如何提升口头化置信度,重点解决两大问题:(1)批评对象:应针对不确定性(问题导向)还是置信度(答案特异性)?分析表明,置信度批评更适合多项选择任务,而不确定性批评在开放式场景中表现更优。(2)批评方式:采用自我批评还是批评校准训练?我们提出让LLM通过自我批评突破单纯准确率优化来实现置信度自省与优化的方法,并创新性提出CritiCal批评校准训练法——利用自然语言批评改进置信度校准,摆脱直接数值优化的局限。实验表明,CritiCal在复杂推理任务中显著优于自我批评及其他竞争基线,甚至超越其教师模型GPT-4o。在分布外场景下,CritiCal亦展现出强大的泛化能力,为提升LLM可靠性开辟了新路径。
English
Accurate confidence calibration in Large Language Models (LLMs) is critical for safe use in high-stakes domains, where clear verbalized confidence enhances user trust. Traditional methods that mimic reference confidence expressions often fail to capture the reasoning needed for accurate confidence assessment. We propose natural language critiques as a solution, ideally suited for confidence calibration, as precise gold confidence labels are hard to obtain and often require multiple generations. This paper studies how natural language critiques can enhance verbalized confidence, addressing: (1) What to critique: uncertainty (question-focused) or confidence (answer-specific)? Analysis shows confidence suits multiple-choice tasks, while uncertainty excels in open-ended scenarios. (2) How to critique: self-critique or critique calibration training? We propose Self-Critique, enabling LLMs to critique and optimize their confidence beyond mere accuracy, and CritiCal, a novel Critique Calibration training method that leverages natural language critiques to improve confidence calibration, moving beyond direct numerical optimization. Experiments show that CritiCal significantly outperforms Self-Critique and other competitive baselines, even surpassing its teacher model, GPT-4o, in complex reasoning tasks. CritiCal also shows robust generalization in out-of-distribution settings, advancing LLM's reliability.
PDF32December 2, 2025