SEAM：跨模态语义等价性基准测试——面向视觉-语言模型

摘要

评估视觉-语言模型（VLMs）在不同表征间是否保持一致的推理能力颇具挑战，因为模态间的比较通常受到任务差异和信息不对称的干扰。我们引入了SEAM基准，该基准在四个已有标准化文本与视觉符号的领域中，配对了语义等价的输入。通过采用跨模态的独特符号系统，与基于OCR的图像-文本配对不同，SEAM为VLMs的文本符号与视觉空间推理能力提供了严格的对比评估。在21个当代模型中，我们观察到系统性的模态不平衡：尽管问题包含语义等价的信息，视觉在整体性能上常落后于语言，且跨模态一致性相对较低。我们的错误分析揭示了两大主要原因：领域符号中因分词导致的文本感知失败，以及引发幻觉的视觉感知失败。我们还证明，我们的结果对视觉变换具有较高的鲁棒性。SEAM为衡量和提升模态无关的推理能力，建立了一个受控且语义等价的环境。

English

Evaluating whether vision-language models (VLMs) reason consistently across representations is challenging because modality comparisons are typically confounded by task differences and asymmetric information. We introduce SEAM, a benchmark that pairs semantically equivalent inputs across four domains that have existing standardized textual and visual notations. By employing distinct notation systems across modalities, in contrast to OCR-based image-text pairing, SEAM provides a rigorous comparative assessment of the textual-symbolic and visual-spatial reasoning capabilities of VLMs. Across 21 contemporary models, we observe systematic modality imbalance: vision frequently lags language in overall performance, despite the problems containing semantically equivalent information, and cross-modal agreement is relatively low. Our error analysis reveals two main drivers: textual perception failures from tokenization in domain notation and visual perception failures that induce hallucinations. We also show that our results are largely robust to visual transformations. SEAM establishes a controlled, semantically equivalent setting for measuring and improving modality-agnostic reasoning.

SEAM：跨模态语义等价性基准测试——面向视觉-语言模型

SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models

摘要

Support