VLM-SubtleBench：视觉语言模型距离人类级别的细微比较推理能力还有多远？

摘要

区分视觉相似图像间的细微差异能力，在工业异常检测、医学影像分析及航空监控等诸多领域具有关键意义。尽管近期涌现了针对视觉语言模型的比较推理基准测试，但它们主要聚焦于存在显著差异的图像，未能捕捉现实应用所需的精细推理能力。本研究提出VLM-SubtleBench基准，专门评估视觉语言模型在细微差异比较推理上的表现。该基准涵盖属性、状态、情绪、时序、空间、存在性、数量、质量、视角与动作十种差异类型，并构建了反映这些细微变化的成对问题-图像集。与现有局限于自然图像数据集的基准不同，我们的基准覆盖工业、航空及医学影像等多领域。通过对专有和开源视觉语言模型的广泛评估，我们揭示了模型与人类表现之间在差异类型和领域上的系统性差距，并通过受控分析指出模型推理能力显著下降的具体情境。本研究的基准与发现共同为推进视觉语言模型实现人类水平的比较推理奠定了重要基础。

English

The ability to distinguish subtle differences between visually similar images is essential for diverse domains such as industrial anomaly detection, medical imaging, and aerial surveillance. While comparative reasoning benchmarks for vision-language models (VLMs) have recently emerged, they primarily focus on images with large, salient differences and fail to capture the nuanced reasoning required for real-world applications. In this work, we introduce VLM-SubtleBench, a benchmark designed to evaluate VLMs on subtle comparative reasoning. Our benchmark covers ten difference types - Attribute, State, Emotion, Temporal, Spatial, Existence, Quantity, Quality, Viewpoint, and Action - and curate paired question-image sets reflecting these fine-grained variations. Unlike prior benchmarks restricted to natural image datasets, our benchmark spans diverse domains, including industrial, aerial, and medical imagery. Through extensive evaluation of both proprietary and open-source VLMs, we reveal systematic gaps between model and human performance across difference types and domains, and provide controlled analyses highlighting where VLMs' reasoning sharply deteriorates. Together, our benchmark and findings establish a foundation for advancing VLMs toward human-level comparative reasoning.

VLM-SubtleBench：视觉语言模型距离人类级别的细微比较推理能力还有多远？

VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?

摘要

Support