VLM-SubtleBench: Quanto Sono Lontani i VLM dalla Ragionamento Comparativo Sottile di Livello Umano?

Abstract

La capacità di distinguere differenze sottili tra immagini visivamente simili è essenziale per ambiti diversificati come il rilevamento di anomalie industriali, l'imaging medico e la sorveglianza aerea. Sebbene recentemente siano emersi benchmark di ragionamento comparativo per modelli visione-linguaggio (VLM), questi si concentrano principalmente su immagini con differenze ampie e salienti e non riescono a catturare il ragionamento sfumato richiesto per applicazioni nel mondo reale. In questo lavoro, introduciamo VLM-SubtleBench, un benchmark progettato per valutare i VLM sul ragionamento comparativo sottile. Il nostro benchmark copre dieci tipi di differenza - Attributo, Stato, Emozione, Temporale, Spaziale, Esistenza, Quantità, Qualità, Punto di vista e Azione - e cura set di domande-immagini accoppiati che riflettono queste variazioni granulari. A differenza dei benchmark precedenti limitati a dataset di immagini naturali, il nostro benchmark abbraccia domini diversificati, incluse immagini industriali, aeree e mediche. Attraverso una valutazione estensiva sia di VLM proprietari che open-source, riveliamo lacune sistematiche tra le prestazioni dei modelli e quelle umane attraverso i tipi di differenza e i domini, e forniamo analisi controllate che evidenziano dove il ragionamento dei VLM si deteriora bruscamente. Insieme, il nostro benchmark e i risultati stabiliscono una base per far progredire i VLM verso un ragionamento comparativo di livello umano.

English

The ability to distinguish subtle differences between visually similar images is essential for diverse domains such as industrial anomaly detection, medical imaging, and aerial surveillance. While comparative reasoning benchmarks for vision-language models (VLMs) have recently emerged, they primarily focus on images with large, salient differences and fail to capture the nuanced reasoning required for real-world applications. In this work, we introduce VLM-SubtleBench, a benchmark designed to evaluate VLMs on subtle comparative reasoning. Our benchmark covers ten difference types - Attribute, State, Emotion, Temporal, Spatial, Existence, Quantity, Quality, Viewpoint, and Action - and curate paired question-image sets reflecting these fine-grained variations. Unlike prior benchmarks restricted to natural image datasets, our benchmark spans diverse domains, including industrial, aerial, and medical imagery. Through extensive evaluation of both proprietary and open-source VLMs, we reveal systematic gaps between model and human performance across difference types and domains, and provide controlled analyses highlighting where VLMs' reasoning sharply deteriorates. Together, our benchmark and findings establish a foundation for advancing VLMs toward human-level comparative reasoning.

VLM-SubtleBench: Quanto Sono Lontani i VLM dalla Ragionamento Comparativo Sottile di Livello Umano?

VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?

Abstract

Support