見ることは信じること、しかしどれほどまでか？視覚言語モデルにおける言語化キャリブレーションの包括的分析

要旨

不確実性の定量化は、現代のAIシステムの信頼性と信頼性を評価するために不可欠である。既存のアプローチの中でも、モデルが自然言語を通じて自身の信頼度を表現する「言語化された不確実性」は、大規模言語モデル（LLMs）において軽量で解釈可能な解決策として注目されている。しかし、視覚言語モデル（VLMs）におけるその有効性は十分に研究されていない。本研究では、VLMsにおける言語化された信頼度を、3つのモデルカテゴリ、4つのタスク領域、3つの評価シナリオにわたって包括的に評価する。その結果、現在のVLMsは多様なタスクや設定において顕著な誤較正を示すことが明らかになった。特に、視覚推論モデル（すなわち、画像を用いた思考）は一貫してより良い較正を示し、モダリティ固有の推論が信頼性のある不確実性推定に重要であることを示唆している。較正の課題をさらに解決するために、我々は「視覚的信頼度認識プロンプティング」を導入し、マルチモーダル設定における信頼度の整合性を向上させる2段階のプロンプティング戦略を提案する。全体として、本研究はVLMsにおけるモダリティを超えた内在的な誤較正を浮き彫りにしている。より広く、我々の知見は、信頼性のあるマルチモーダルシステムを進化させる上で、モダリティの整合性とモデルの忠実性の根本的な重要性を強調している。

English

Uncertainty quantification is essential for assessing the reliability and trustworthiness of modern AI systems. Among existing approaches, verbalized uncertainty, where models express their confidence through natural language, has emerged as a lightweight and interpretable solution in large language models (LLMs). However, its effectiveness in vision-language models (VLMs) remains insufficiently studied. In this work, we conduct a comprehensive evaluation of verbalized confidence in VLMs, spanning three model categories, four task domains, and three evaluation scenarios. Our results show that current VLMs often display notable miscalibration across diverse tasks and settings. Notably, visual reasoning models (i.e., thinking with images) consistently exhibit better calibration, suggesting that modality-specific reasoning is critical for reliable uncertainty estimation. To further address calibration challenges, we introduce Visual Confidence-Aware Prompting, a two-stage prompting strategy that improves confidence alignment in multimodal settings. Overall, our study highlights the inherent miscalibration in VLMs across modalities. More broadly, our findings underscore the fundamental importance of modality alignment and model faithfulness in advancing reliable multimodal systems.

見ることは信じること、しかしどれほどまでか？視覚言語モデルにおける言語化キャリブレーションの包括的分析

Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models

要旨

Support