VLM-SubtleBench: VLMは人間レベルの微妙な比較推論にどこまで近づいたか？

要旨

視覚的に類似した画像間の微妙な差異を識別する能力は、産業異常検出、医療画像診断、空中監視など、多様な分野において不可欠である。視覚言語モデル（VLM）の比較推論ベンチマークが最近登場しているが、それらは主に大きく顕著な差異のある画像に焦点を当てており、実世界の応用で必要とされる微妙な推論を捉えられていない。本研究では、VLMの微妙な比較推論を評価するために設計されたベンチマーク「VLM-SubtleBench」を提案する。我々のベンチマークは、属性、状態、感情、時間、空間、存在、数量、質、視点、行動という10種類の差異タイプを網羅し、これらの細かなバリエーションを反映した問題-画像ペアを精選する。自然画像データセットに限定された従来のベンチマークとは異なり、本ベンチマークは産業、航空、医療画像など多様な領域にまたがる。プロプライエタリ及びオープンソースのVLMを広範に評価した結果、差異タイプや領域にわたるモデル性能と人間性能の間の体系的な隔たりを明らかにし、VLMの推論能力が急激に低下するポイントを特定する制御分析を提供する。我々のベンチマークと知見は、VLMを人間レベルの比較推論へと発展させるための基盤を確立するものである。

English

The ability to distinguish subtle differences between visually similar images is essential for diverse domains such as industrial anomaly detection, medical imaging, and aerial surveillance. While comparative reasoning benchmarks for vision-language models (VLMs) have recently emerged, they primarily focus on images with large, salient differences and fail to capture the nuanced reasoning required for real-world applications. In this work, we introduce VLM-SubtleBench, a benchmark designed to evaluate VLMs on subtle comparative reasoning. Our benchmark covers ten difference types - Attribute, State, Emotion, Temporal, Spatial, Existence, Quantity, Quality, Viewpoint, and Action - and curate paired question-image sets reflecting these fine-grained variations. Unlike prior benchmarks restricted to natural image datasets, our benchmark spans diverse domains, including industrial, aerial, and medical imagery. Through extensive evaluation of both proprietary and open-source VLMs, we reveal systematic gaps between model and human performance across difference types and domains, and provide controlled analyses highlighting where VLMs' reasoning sharply deteriorates. Together, our benchmark and findings establish a foundation for advancing VLMs toward human-level comparative reasoning.

VLM-SubtleBench: VLMは人間レベルの微妙な比較推論にどこまで近づいたか？

VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?

要旨

Support