測量多模態大型語言模型中的認知謙遜
Measuring Epistemic Humility in Multimodal Large Language Models
September 11, 2025
作者: Bingkui Tong, Jiaer Xia, Sifeng Shang, Kaiyang Zhou
cs.AI
摘要
在多模態大型語言模型(MLLMs)中,幻覺現象——即模型生成與輸入圖像不一致的內容——在實際應用中構成了重大風險,從視覺問答中的錯誤信息到決策中的不安全錯誤。現有的基準測試主要關注識別準確性,即評估模型能否在干擾項中選擇正確答案。這忽略了一個對於可信AI同等重要的能力:識別何時提供的選項都不正確,這種行為反映了認知謙遜。我們提出了HumbleBench,一個新的幻覺基準測試,旨在評估MLLMs在拒絕看似合理但錯誤答案方面的能力,涵蓋三種幻覺類型:物體、關係和屬性。基於全景場景圖數據集,我們利用細粒度的場景圖註釋提取真實實體和關係,並提示GPT-4-Turbo生成多選題,隨後進行嚴格的篩選過程。每個問題都包含一個“以上皆非”選項,要求模型不僅要識別正確的視覺信息,還要在沒有有效答案時做出判斷。我們在HumbleBench上評估了多種最先進的MLLMs——包括通用型和專用推理模型——並與社區分享了寶貴的發現和見解。通過引入明確的錯誤選項拒絕機制,HumbleBench填補了當前評估套件中的關鍵空白,為安全關鍵場景中的MLLM可靠性提供了更真實的衡量標準。我們的代碼和數據集已公開發布,可訪問https://github.com/maifoundations/HumbleBench。
English
Hallucinations in multimodal large language models (MLLMs) -- where the model
generates content inconsistent with the input image -- pose significant risks
in real-world applications, from misinformation in visual question answering to
unsafe errors in decision-making. Existing benchmarks primarily test
recognition accuracy, i.e., evaluating whether models can select the correct
answer among distractors. This overlooks an equally critical capability for
trustworthy AI: recognizing when none of the provided options are correct, a
behavior reflecting epistemic humility. We present HumbleBench, a new
hallucination benchmark designed to evaluate MLLMs' ability to reject plausible
but incorrect answers across three hallucination types: object, relation, and
attribute. Built from a panoptic scene graph dataset, we leverage fine-grained
scene graph annotations to extract ground-truth entities and relations, and
prompt GPT-4-Turbo to generate multiple-choice questions, followed by a
rigorous manual filtering process. Each question includes a "None of the above"
option, requiring models not only to recognize correct visual information but
also to identify when no provided answer is valid. We evaluate a variety of
state-of-the-art MLLMs -- including both general-purpose and specialized
reasoning models -- on HumbleBench and share valuable findings and insights
with the community. By incorporating explicit false-option rejection,
HumbleBench fills a key gap in current evaluation suites, providing a more
realistic measure of MLLM reliability in safety-critical settings. Our code and
dataset are released publicly and can be accessed at
https://github.com/maifoundations/HumbleBench.