Van pixels naar concepten: Begrijpen segmentatiemodellen wat ze segmenteren?

Samenvatting

Segmentatie is een fundamentele visietaak die ten grondslag ligt aan talloze downstream-toepassingen. Recente promptbare segmentatiemodellen, zoals het Segment Anything Model 3 (SAM3), breiden segmentatie uit van categorie-agnostische maskervoorspelling naar conceptgestuurde lokalisatie, geconditioneerd op hoogwaardige tekstuele prompts. Bestaande benchmarks evalueren echter voornamelijk de maskernauwkeurigheid of de aanwezigheid van objecten, waardoor onduidelijk blijft of deze modellen het opgevraagde concept getrouw grondvesten of in plaats daarvan vertrouwen op visueel opvallende maar semantisch misleidende aanwijzingen. We introduceren CAFE: Counterfactual Attribute Factuality Evaluation (Evaluatie van tegenfeitelijke attribuutfeiten), een nieuwe benchmark voor het evalueren van conceptgetrouwe segmentatie in promptbare segmentatiemodellen. Onze CAFE is gebaseerd op tegenfeitelijke manipulatie op attribuutniveau: het doelgebied en het grondwaarheidsmasker worden behouden, terwijl attributen zoals uiterlijk van het oppervlak, context of materiaalsamenstelling worden gewijzigd om misleidende semantische aanwijzingen te introduceren. De benchmark bevat 2.146 gepaarde testvoorbeelden, elk bestaande uit een doelafbeelding, een grondwaarheidsmasker, een positieve prompt en een misleidende negatieve prompt. Deze voorbeelden beslaan drie tegenfeitelijke categorieën: Superficial Mimicry (SM), Context Conflict (CC) en Ontological Conflict (OC). We evalueren verschillende modeltypen en -groottes op onze CAFE. Experimenten onthullen een systematische kloof tussen lokalisatiekwaliteit en conceptdiscriminatie: modellen genereren vaak nauwkeurige maskers, zelfs voor misleidende prompts, wat suggereert dat sterke maskervoorspelling niet noodzakelijkerwijs getrouwe semantische grondvesting impliceert. Onze CAFE biedt een gecontroleerde benchmark voor het diagnosticeren of promptbare segmentatiemodellen conceptgetrouwe grondvesting uitvoeren in plaats van shortcut-gestuurde maskerterugwinning.

English

Segmentation is a fundamental vision task underlying numerous downstream applications. Recent promptable segmentation models, such as Segment Anything Model 3 (SAM3), extend segmentation from category-agnostic mask prediction to concept-guided localization conditioned on high-level textual prompts. However, existing benchmarks primarily evaluate mask accuracy or object presence, leaving unclear whether these models faithfully ground the queried concept or instead rely on visually salient but semantically misleading cues. We introduce CAFE: Counterfactual Attribute Factuality Evaluation, a novel benchmark for evaluating concept-faithful segmentation in promptable segmentation models. Our CAFE is built on attribute-level counterfactual manipulation: the target region and ground-truth mask are preserved, while attributes such as surface appearance, context, or material composition are modified to introduce misleading semantic cues. The benchmark contains 2,146 paired test samples, each consisting of a target image, a ground-truth mask, a positive prompt, and a misleading negative prompt. These samples cover three counterfactual categories: Superficial Mimicry (SM), Context Conflict (CC), and Ontological Conflict (OC). We evaluate various model types and sizes on our CAFE. Experiments reveal a systematic gap between localization quality and concept discrimination: models often generate accurate masks even for misleading prompts, suggesting that strong mask prediction does not necessarily imply faithful semantic grounding. Our CAFE provides a controlled benchmark for diagnosing whether promptable segmentation models perform concept-faithful grounding rather than shortcut-driven mask retrieval.