眼見為憑，然其可信度幾何？視覺語言模型中的口語化校準全面分析

摘要

不確定性量化對於評估現代人工智能系統的可靠性與可信度至關重要。在現有方法中，語言化不確定性——即模型通過自然語言表達其置信度——已成為大型語言模型（LLMs）中一種輕量且可解釋的解決方案。然而，其在視覺語言模型（VLMs）中的有效性尚未得到充分研究。本研究對VLMs中的語言化置信度進行了全面評估，涵蓋了三類模型、四個任務領域及三種評估場景。結果顯示，當前VLMs在多樣化任務與設置中常表現出顯著的校準失誤。值得注意的是，視覺推理模型（即基於圖像的思考）始終展現出更好的校準性，這表明特定模態的推理對於可靠的不確定性估計至關重要。為進一步應對校準挑戰，我們引入了視覺置信度感知提示法，這是一種兩階段提示策略，旨在提升多模態設置中的置信度對齊。總體而言，本研究揭示了VLMs跨模態的固有校準失誤。更廣泛地，我們的發現強調了模態對齊與模型忠實性在推進可靠多模態系統中的根本重要性。

English

Uncertainty quantification is essential for assessing the reliability and trustworthiness of modern AI systems. Among existing approaches, verbalized uncertainty, where models express their confidence through natural language, has emerged as a lightweight and interpretable solution in large language models (LLMs). However, its effectiveness in vision-language models (VLMs) remains insufficiently studied. In this work, we conduct a comprehensive evaluation of verbalized confidence in VLMs, spanning three model categories, four task domains, and three evaluation scenarios. Our results show that current VLMs often display notable miscalibration across diverse tasks and settings. Notably, visual reasoning models (i.e., thinking with images) consistently exhibit better calibration, suggesting that modality-specific reasoning is critical for reliable uncertainty estimation. To further address calibration challenges, we introduce Visual Confidence-Aware Prompting, a two-stage prompting strategy that improves confidence alignment in multimodal settings. Overall, our study highlights the inherent miscalibration in VLMs across modalities. More broadly, our findings underscore the fundamental importance of modality alignment and model faithfulness in advancing reliable multimodal systems.

眼見為憑，然其可信度幾何？視覺語言模型中的口語化校準全面分析

Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models

摘要

Support