VLM-SubtleBench: 시각 언어 모델은 인간 수준의 미묘한 비교 추론에서 얼마나 떨어져 있을까?

초록

시각적으로 유사한 이미지 간의 미세한 차이를 구분하는 능력은 산업 이상 감지, 의료 영상, 항공 감시 등 다양한 분야에서 필수적입니다. 최근 비전-언어 모델(VLM)을 위한 비교 추론 벤치마크가 등장했으나, 이들은 주로 크고 두드러진 차이가 있는 이미지에 초점을 맞추어 실제 응용 분야에서 필요한 미묘한 추론을 포착하지 못하고 있습니다. 본 연구에서는 VLM의 미세 비교 추론 능력을 평가하기 위해 설계된 벤치마크인 VLM-SubtleBench를 소개합니다. 우리의 벤치마크는 속성, 상태, 감정, 시간, 공간, 존재, 수량, 품질, 시점, 행동 등 10가지 차이 유형을 포괄하며, 이러한 세분화된 변이를 반영한 질문-이미지 쌍을 정제했습니다. 자연 이미지 데이터셋에 국한된 기존 벤치마크와 달리, 우리의 벤치마크는 산업, 항공, 의료 영상을 포함한 다양한 분야를 아우릅니다. 독점 및 오픈소스 VLM 모두에 대한 포괄적인 평가를 통해 차이 유형과 도메인 전반에 걸쳐 모델 성능과 인간 성능 간의 체계적인 격차를 밝히고, VLM의 추론 능력이 급격히 저하되는 지점을 부각하는 통제된 분석을 제공합니다. 본 연구의 벤치마크와 결과는 VLM이 인간 수준의 비교 추론으로 나아가기 위한 기초를 마련합니다.

English

The ability to distinguish subtle differences between visually similar images is essential for diverse domains such as industrial anomaly detection, medical imaging, and aerial surveillance. While comparative reasoning benchmarks for vision-language models (VLMs) have recently emerged, they primarily focus on images with large, salient differences and fail to capture the nuanced reasoning required for real-world applications. In this work, we introduce VLM-SubtleBench, a benchmark designed to evaluate VLMs on subtle comparative reasoning. Our benchmark covers ten difference types - Attribute, State, Emotion, Temporal, Spatial, Existence, Quantity, Quality, Viewpoint, and Action - and curate paired question-image sets reflecting these fine-grained variations. Unlike prior benchmarks restricted to natural image datasets, our benchmark spans diverse domains, including industrial, aerial, and medical imagery. Through extensive evaluation of both proprietary and open-source VLMs, we reveal systematic gaps between model and human performance across difference types and domains, and provide controlled analyses highlighting where VLMs' reasoning sharply deteriorates. Together, our benchmark and findings establish a foundation for advancing VLMs toward human-level comparative reasoning.

VLM-SubtleBench: 시각 언어 모델은 인간 수준의 미묘한 비교 추론에서 얼마나 떨어져 있을까?

VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?

초록

Support