眼见为实,但究竟几分可信?视觉-语言模型言语校准的全面分析
Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models
May 26, 2025
作者: Weihao Xuan, Qingcheng Zeng, Heli Qi, Junjue Wang, Naoto Yokoya
cs.AI
摘要
不确定性量化对于评估现代AI系统的可靠性和可信度至关重要。在现有方法中,语言化不确定性——即模型通过自然语言表达其置信度——已成为大型语言模型(LLMs)中一种轻量级且可解释的解决方案。然而,其在视觉语言模型(VLMs)中的有效性尚未得到充分研究。在本研究中,我们对VLMs中的语言化置信度进行了全面评估,涵盖三种模型类别、四个任务领域和三种评估场景。我们的结果表明,当前的VLMs在不同任务和设置下常表现出显著的校准偏差。值得注意的是,视觉推理模型(即基于图像的思考)始终展现出更好的校准效果,这表明特定模态的推理对于可靠的不确定性估计至关重要。为进一步应对校准挑战,我们引入了视觉置信感知提示法,这是一种两阶段提示策略,可提升多模态环境下的置信度对齐。总体而言,我们的研究揭示了VLMs跨模态的固有校准偏差。更广泛地,我们的发现强调了模态对齐和模型忠实性在推进可靠多模态系统中的根本重要性。
English
Uncertainty quantification is essential for assessing the reliability and
trustworthiness of modern AI systems. Among existing approaches, verbalized
uncertainty, where models express their confidence through natural language,
has emerged as a lightweight and interpretable solution in large language
models (LLMs). However, its effectiveness in vision-language models (VLMs)
remains insufficiently studied. In this work, we conduct a comprehensive
evaluation of verbalized confidence in VLMs, spanning three model categories,
four task domains, and three evaluation scenarios. Our results show that
current VLMs often display notable miscalibration across diverse tasks and
settings. Notably, visual reasoning models (i.e., thinking with images)
consistently exhibit better calibration, suggesting that modality-specific
reasoning is critical for reliable uncertainty estimation. To further address
calibration challenges, we introduce Visual Confidence-Aware Prompting, a
two-stage prompting strategy that improves confidence alignment in multimodal
settings. Overall, our study highlights the inherent miscalibration in VLMs
across modalities. More broadly, our findings underscore the fundamental
importance of modality alignment and model faithfulness in advancing reliable
multimodal systems.Summary
AI-Generated Summary