SEAM：跨模態語義等價基準測試——面向視覺語言模型

摘要

評估視覺語言模型（VLMs）在不同表徵間是否具備一致的推理能力具有挑戰性，因為模態間的比較通常會受到任務差異和信息不對稱的干擾。我們引入了SEAM，這是一個在四個已有標準化文本和視覺符號的領域中配對語義等價輸入的基準。通過在模態間採用不同的符號系統，與基於光學字符識別（OCR）的圖像-文本配對形成對比，SEAM為VLMs的文本符號和視覺空間推理能力提供了嚴格的比較評估。在21個當代模型中，我們觀察到系統性的模態不平衡：儘管問題包含語義等價的信息，視覺在整體性能上常常落後於語言，且跨模態的一致性相對較低。我們的錯誤分析揭示了兩個主要驅動因素：領域符號中因分詞導致的文本感知失敗，以及引發幻覺的視覺感知失敗。我們還展示了我們的結果在視覺轉換下大體上是穩健的。SEAM建立了一個受控的、語義等價的環境，用於衡量和提升模態無關的推理能力。

English

Evaluating whether vision-language models (VLMs) reason consistently across representations is challenging because modality comparisons are typically confounded by task differences and asymmetric information. We introduce SEAM, a benchmark that pairs semantically equivalent inputs across four domains that have existing standardized textual and visual notations. By employing distinct notation systems across modalities, in contrast to OCR-based image-text pairing, SEAM provides a rigorous comparative assessment of the textual-symbolic and visual-spatial reasoning capabilities of VLMs. Across 21 contemporary models, we observe systematic modality imbalance: vision frequently lags language in overall performance, despite the problems containing semantically equivalent information, and cross-modal agreement is relatively low. Our error analysis reveals two main drivers: textual perception failures from tokenization in domain notation and visual perception failures that induce hallucinations. We also show that our results are largely robust to visual transformations. SEAM establishes a controlled, semantically equivalent setting for measuring and improving modality-agnostic reasoning.

SEAM：跨模態語義等價基準測試——面向視覺語言模型

SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models

摘要

Support