보는 것이 믿는 것이지만, 얼마나 믿을 만한가? 시각-언어 모델의 언어화된 보정에 대한 포괄적 분석

초록

불확실성 정량화는 현대 AI 시스템의 신뢰성과 신뢰성을 평가하는 데 필수적입니다. 기존 접근법 중에서, 모델이 자연어를 통해 자신의 확신을 표현하는 언어화된 불확실성(verbalized uncertainty)은 대형 언어 모델(LLMs)에서 경량화되고 해석 가능한 솔루션으로 부상했습니다. 그러나 비전-언어 모델(VLMs)에서의 효과는 아직 충분히 연구되지 않았습니다. 본 연구에서는 세 가지 모델 범주, 네 가지 작업 영역, 그리고 세 가지 평가 시나리오에 걸쳐 VLMs의 언어화된 신뢰도를 포괄적으로 평가합니다. 우리의 결과는 현재의 VLMs이 다양한 작업과 설정에서 주목할 만한 오차 보정(miscalibration)을 보인다는 것을 보여줍니다. 특히, 시각 추론 모델(즉, 이미지를 통해 사고하는 모델)은 일관적으로 더 나은 보정을 보여주며, 이는 신뢰할 수 있는 불확실성 추정을 위해 모달리티 특정적 추론이 중요함을 시사합니다. 보정 문제를 더욱 해결하기 위해, 우리는 다중모달 설정에서 신뢰도 정렬을 개선하는 두 단계 프롬프트 전략인 시각적 신뢰도 인식 프롬프팅(Visual Confidence-Aware Prompting)을 소개합니다. 전반적으로, 우리의 연구는 VLMs에서 모달리티 간에 내재된 오차 보정을 강조합니다. 더 넓게 보면, 우리의 발견은 신뢰할 수 있는 다중모달 시스템을 발전시키기 위해 모달리티 정렬과 모델의 신뢰성이 근본적으로 중요함을 강조합니다.

English

Uncertainty quantification is essential for assessing the reliability and trustworthiness of modern AI systems. Among existing approaches, verbalized uncertainty, where models express their confidence through natural language, has emerged as a lightweight and interpretable solution in large language models (LLMs). However, its effectiveness in vision-language models (VLMs) remains insufficiently studied. In this work, we conduct a comprehensive evaluation of verbalized confidence in VLMs, spanning three model categories, four task domains, and three evaluation scenarios. Our results show that current VLMs often display notable miscalibration across diverse tasks and settings. Notably, visual reasoning models (i.e., thinking with images) consistently exhibit better calibration, suggesting that modality-specific reasoning is critical for reliable uncertainty estimation. To further address calibration challenges, we introduce Visual Confidence-Aware Prompting, a two-stage prompting strategy that improves confidence alignment in multimodal settings. Overall, our study highlights the inherent miscalibration in VLMs across modalities. More broadly, our findings underscore the fundamental importance of modality alignment and model faithfulness in advancing reliable multimodal systems.

보는 것이 믿는 것이지만, 얼마나 믿을 만한가? 시각-언어 모델의 언어화된 보정에 대한 포괄적 분석

Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models

초록

Support