ChatPaper.aiChatPaper

SEAM:跨模态语义等价性基准测试——面向视觉-语言模型

SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models

August 25, 2025
作者: Zhenwei Tang, Difan Jiao, Blair Yang, Ashton Anderson
cs.AI

摘要

评估视觉-语言模型(VLMs)在不同表征间是否保持一致的推理能力颇具挑战,因为模态间的比较通常受到任务差异和信息不对称的干扰。我们引入了SEAM基准,该基准在四个已有标准化文本与视觉符号的领域中,配对了语义等价的输入。通过采用跨模态的独特符号系统,与基于OCR的图像-文本配对不同,SEAM为VLMs的文本符号与视觉空间推理能力提供了严格的对比评估。在21个当代模型中,我们观察到系统性的模态不平衡:尽管问题包含语义等价的信息,视觉在整体性能上常落后于语言,且跨模态一致性相对较低。我们的错误分析揭示了两大主要原因:领域符号中因分词导致的文本感知失败,以及引发幻觉的视觉感知失败。我们还证明,我们的结果对视觉变换具有较高的鲁棒性。SEAM为衡量和提升模态无关的推理能力,建立了一个受控且语义等价的环境。
English
Evaluating whether vision-language models (VLMs) reason consistently across representations is challenging because modality comparisons are typically confounded by task differences and asymmetric information. We introduce SEAM, a benchmark that pairs semantically equivalent inputs across four domains that have existing standardized textual and visual notations. By employing distinct notation systems across modalities, in contrast to OCR-based image-text pairing, SEAM provides a rigorous comparative assessment of the textual-symbolic and visual-spatial reasoning capabilities of VLMs. Across 21 contemporary models, we observe systematic modality imbalance: vision frequently lags language in overall performance, despite the problems containing semantically equivalent information, and cross-modal agreement is relatively low. Our error analysis reveals two main drivers: textual perception failures from tokenization in domain notation and visual perception failures that induce hallucinations. We also show that our results are largely robust to visual transformations. SEAM establishes a controlled, semantically equivalent setting for measuring and improving modality-agnostic reasoning.
PDF92August 28, 2025