HalluSegBench:基於反事實視覺推理的分割幻覺評估
HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation
June 26, 2025
作者: Xinzhuo Li, Adheesh Juvekar, Xingyou Liu, Muntasir Wahed, Kiet A. Nguyen, Ismini Lourentzou
cs.AI
摘要
近期在視覺語言分割領域的進展顯著提升了基於視覺的理解能力。然而,這些模型常常出現幻覺現象,即為圖像內容中並未存在的物體生成分割掩碼,或錯誤標記不相關的區域。現有的分割幻覺評估協議主要集中於標籤或文本層面的幻覺,而未對視覺上下文進行操控,這限制了其診斷關鍵失敗的能力。針對這一問題,我們引入了HalluSegBench,這是首個專門設計來通過反事實視覺推理的視角評估視覺基礎中幻覺現象的基準。我們的基準包含一個由1340對反事實實例組成的新數據集,涵蓋281個獨特物體類別,以及一套新引入的度量標準,這些標準量化了在視覺連貫場景編輯下的幻覺敏感性。在HalluSegBench上對當前最先進的視覺語言分割模型進行的實驗表明,視覺驅動的幻覺比標籤驅動的更為普遍,模型往往持續進行錯誤分割,這凸顯了利用反事實推理來診斷基礎保真度的必要性。
English
Recent progress in vision-language segmentation has significantly advanced
grounded visual understanding. However, these models often exhibit
hallucinations by producing segmentation masks for objects not grounded in the
image content or by incorrectly labeling irrelevant regions. Existing
evaluation protocols for segmentation hallucination primarily focus on label or
textual hallucinations without manipulating the visual context, limiting their
capacity to diagnose critical failures. In response, we introduce
HalluSegBench, the first benchmark specifically designed to evaluate
hallucinations in visual grounding through the lens of counterfactual visual
reasoning. Our benchmark consists of a novel dataset of 1340 counterfactual
instance pairs spanning 281 unique object classes, and a set of newly
introduced metrics that quantify hallucination sensitivity under visually
coherent scene edits. Experiments on HalluSegBench with state-of-the-art
vision-language segmentation models reveal that vision-driven hallucinations
are significantly more prevalent than label-driven ones, with models often
persisting in false segmentation, highlighting the need for counterfactual
reasoning to diagnose grounding fidelity.