多模态大语言模型中的认知谦逊度测量
Measuring Epistemic Humility in Multimodal Large Language Models
September 11, 2025
作者: Bingkui Tong, Jiaer Xia, Sifeng Shang, Kaiyang Zhou
cs.AI
摘要
在多模态大语言模型(MLLMs)中,幻觉现象——即模型生成与输入图像内容不符的信息——在现实应用中带来了显著风险,从视觉问答中的误导信息到决策过程中的不安全错误。现有基准测试主要关注识别准确率,即评估模型能否在干扰项中选出正确答案。然而,这忽略了一个对可信AI同样至关重要的能力:识别何时提供的选项均不正确,这种行为体现了认知谦逊。我们推出了HumbleBench,一个全新的幻觉基准测试,旨在评估MLLMs在三种幻觉类型(对象、关系和属性)上拒绝看似合理但错误答案的能力。该基准基于全景场景图数据集构建,利用细粒度的场景图注释提取真实实体和关系,并通过GPT-4-Turbo生成多项选择题,随后经过严格的人工筛选。每道题均包含“以上都不是”选项,要求模型不仅需识别正确的视觉信息,还需判断何时无有效答案。我们对包括通用型和专用推理模型在内的多种前沿MLLMs进行了HumbleBench评估,并与社区分享了宝贵的发现和见解。通过引入明确的错误选项拒绝机制,HumbleBench填补了当前评估体系中的关键空白,为安全关键场景下MLLM的可靠性提供了更为真实的衡量标准。我们的代码和数据集已公开发布,可通过https://github.com/maifoundations/HumbleBench访问。
English
Hallucinations in multimodal large language models (MLLMs) -- where the model
generates content inconsistent with the input image -- pose significant risks
in real-world applications, from misinformation in visual question answering to
unsafe errors in decision-making. Existing benchmarks primarily test
recognition accuracy, i.e., evaluating whether models can select the correct
answer among distractors. This overlooks an equally critical capability for
trustworthy AI: recognizing when none of the provided options are correct, a
behavior reflecting epistemic humility. We present HumbleBench, a new
hallucination benchmark designed to evaluate MLLMs' ability to reject plausible
but incorrect answers across three hallucination types: object, relation, and
attribute. Built from a panoptic scene graph dataset, we leverage fine-grained
scene graph annotations to extract ground-truth entities and relations, and
prompt GPT-4-Turbo to generate multiple-choice questions, followed by a
rigorous manual filtering process. Each question includes a "None of the above"
option, requiring models not only to recognize correct visual information but
also to identify when no provided answer is valid. We evaluate a variety of
state-of-the-art MLLMs -- including both general-purpose and specialized
reasoning models -- on HumbleBench and share valuable findings and insights
with the community. By incorporating explicit false-option rejection,
HumbleBench fills a key gap in current evaluation suites, providing a more
realistic measure of MLLM reliability in safety-critical settings. Our code and
dataset are released publicly and can be accessed at
https://github.com/maifoundations/HumbleBench.