SEAM: 시각-언어 모델을 위한 의미론적 등가성 교차 모달리티 벤치마크

초록

시각-언어 모델(VLMs)이 다양한 표현 간에 일관되게 추론하는지 평가하는 것은 모달리티 비교가 일반적으로 작업 차이와 비대칭적 정보에 의해 혼동되기 때문에 어려운 과제입니다. 우리는 SEAM이라는 벤치마크를 소개합니다. SEAM은 기존에 표준화된 텍스트 및 시각적 표기법이 존재하는 네 가지 도메인에서 의미적으로 동등한 입력 쌍을 제공합니다. OCR 기반 이미지-텍스트 쌍과 달리, 다양한 표기 시스템을 모달리티 간에 적용함으로써 SEAM은 VLMs의 텍스트-기호적 및 시각-공간적 추론 능력을 엄격하게 비교 평가할 수 있습니다. 21개의 최신 모델을 대상으로 한 실험에서, 우리는 체계적인 모달리티 불균형을 관찰했습니다: 문제가 의미적으로 동등한 정보를 포함하고 있음에도 불구하고, 시각적 성능이 언어적 성능에 비해 종종 뒤처지며, 교차 모달리티 일치도 상대적으로 낮았습니다. 우리의 오류 분석은 두 가지 주요 원인을 밝혀냈습니다: 도메인 표기법에서 토큰화로 인한 텍스트 인식 실패와 환각을 유발하는 시각적 인식 실패입니다. 또한, 우리의 결과가 시각적 변환에 대해 대체로 강건하다는 것을 보여줍니다. SEAM은 모달리티에 구애받지 않는 추론을 측정하고 개선하기 위한 통제된, 의미적으로 동등한 환경을 마련합니다.

English

Evaluating whether vision-language models (VLMs) reason consistently across representations is challenging because modality comparisons are typically confounded by task differences and asymmetric information. We introduce SEAM, a benchmark that pairs semantically equivalent inputs across four domains that have existing standardized textual and visual notations. By employing distinct notation systems across modalities, in contrast to OCR-based image-text pairing, SEAM provides a rigorous comparative assessment of the textual-symbolic and visual-spatial reasoning capabilities of VLMs. Across 21 contemporary models, we observe systematic modality imbalance: vision frequently lags language in overall performance, despite the problems containing semantically equivalent information, and cross-modal agreement is relatively low. Our error analysis reveals two main drivers: textual perception failures from tokenization in domain notation and visual perception failures that induce hallucinations. We also show that our results are largely robust to visual transformations. SEAM establishes a controlled, semantically equivalent setting for measuring and improving modality-agnostic reasoning.

SEAM: 시각-언어 모델을 위한 의미론적 등가성 교차 모달리티 벤치마크

SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models

초록

Support