眼见未必为实：揭示评估者视觉语言模型的认知盲区

摘要

大型视觉语言模型（VLMs）正日益广泛地用于评估其他模型的输出，涵盖图像到文本（I2T）任务（如视觉问答）和文本到图像（T2I）生成任务。尽管依赖度不断加深，这些评估型VLM的可靠性仍缺乏系统研究。本文通过I2T和T2I两类任务，系统评估了评估型VLM的可靠性。我们引入针对性扰动，沿关键错误维度（包括物体幻觉、空间推理、事实依据和视觉保真度）降低输出质量，以检验评估型VLM能否在评估中可靠识别这些质量退化型错误。通过涵盖40个扰动维度、超过4000个扰动实例的综合基准，我们采用单答案评分、成对比较和参考引导三种范式对4个主流VLM进行评估。研究发现：当前VLM评估器存在显著盲区——对扰动输出的漏检率最高可超50%；尤其难以识别细粒度组合错误和空间错误；对违背输入图像的幻觉内容常表现出不敏感。成对比较法虽更可靠，但失误率依然存在。这些结果揭示了当前评估型VLM的不可靠性，警示需谨慎将其用于模型基准测试和开发决策。相关代码与数据已公开。

English

Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains under explored. In this work, we systematically evaluate the reliability of Evaluator VLMs across both I2T and T2I tasks. We introduce targeted perturbations that degrade output quality along key error dimensions, including object hallucinations, spatial reasoning, factual grounding, and visual fidelity. These perturbations test whether Evaluator VLMs can reliably account for these quality degrading errors in their evaluations. Using a comprehensive benchmark of over 4000 perturbed instances spanning 40 perturbation dimensions, we evaluate 4 prominent VLMs using single-answer scoring, pairwise comparison, and reference-guided paradigms. Our findings reveal that current VLM evaluators exhibit substantial blind spots: they often fail to detect perturbed outputs - in some cases exceeding 50%, struggle particularly with fine-grained compositional and spatial errors, and are often insensitive to hallucinated content that contradicts the input image. Pairwise comparison proves more reliable, though failure rates persist. These results highlight the unreliable nature of current Evaluator VLMs and urge caution in their deployment for benchmarking and development decisions. Code and data have been made publicly available.

眼见未必为实：揭示评估者视觉语言模型的认知盲区

Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

摘要

Support