ChatPaper.aiChatPaper

NAACL:面向RAG系统中大语言模型的噪声感知语言置信度校准

NAACL: Noise-AwAre Verbal Confidence Calibration for LLMs in RAG Systems

January 16, 2026
作者: Jiayu Liu, Rui Wang, Qing Zong, Qingcheng Zeng, Tianshi Zheng, Haochen Shi, Dadi Guo, Baixuan Xu, Chunyang Li, Yangqiu Song
cs.AI

摘要

在大模型应用于关键事实性领域时,准确评估模型置信度至关重要。尽管检索增强生成技术被广泛采用以提升信息依据性,但该场景下的置信度校准机制仍不明确。我们在四个基准测试上展开系统性研究,发现由于检索上下文的噪声干扰,大模型呈现出较差的校准性能。具体而言,矛盾或无关的证据往往会放大模型的错误确定性,导致严重过度自信。为此,我们提出NAACL规则(噪声感知置信度校准规则),为噪声环境下的过度自信问题建立理论解决基础。基于这些规则,我们进一步设计NAACL框架——通过整合约2000个HotpotQA样本的监督信号,构建噪声感知校准机制。借助基于该数据的监督微调,NAACL无需依赖更强教师模型即可使模型具备内在的噪声感知能力。实验结果表明,NAACL带来显著提升:领域内ECE分数提升10.9%,跨领域提升8.0%。通过弥合检索噪声与语言校准之间的鸿沟,NAACL为构建既精确又具备认知可靠性的大模型开辟了新路径。
English
Accurately assessing model confidence is essential for deploying large language models (LLMs) in mission-critical factual domains. While retrieval-augmented generation (RAG) is widely adopted to improve grounding, confidence calibration in RAG settings remains poorly understood. We conduct a systematic study across four benchmarks, revealing that LLMs exhibit poor calibration performance due to noisy retrieved contexts. Specifically, contradictory or irrelevant evidence tends to inflate the model's false certainty, leading to severe overconfidence. To address this, we propose NAACL Rules (Noise-AwAre Confidence CaLibration Rules) to provide a principled foundation for resolving overconfidence under noise. We further design NAACL, a noise-aware calibration framework that synthesizes supervision from about 2K HotpotQA examples guided by these rules. By performing supervised fine-tuning (SFT) with this data, NAACL equips models with intrinsic noise awareness without relying on stronger teacher models. Empirical results show that NAACL yields substantial gains, improving ECE scores by 10.9% in-domain and 8.0% out-of-domain. By bridging the gap between retrieval noise and verbal calibration, NAACL paves the way for both accurate and epistemically reliable LLMs.
PDF81January 21, 2026