意味的豊かさか幾何学的推論か？視覚言語モデルの視覚的不変性の脆弱性

要旨

本研究は、基本的な幾何学的変換に対する最先端視覚言語モデル（VLM）の根本的な脆弱性を調査する。現代のVLMは、標準的な向きでの物体認識や複雑なシーンの記述といった意味的タスクでは優れた性能を発揮する一方、より基礎的なレベルでは体系的な失敗を示す。すなわち、単純な回転・拡大縮小・同一変換下で物体の同一性を確実に判断するために必要な、頑健な空間的不変性と等価性を欠如している。我々はこの限界を、記号的スケッチ、自然写真、抽象芸術を含む多様な視覚領域にわたる体系的な評価を通じて実証する。意味的コンテンツが希薄になるにつれて性能は急激に低下し、この挙動はアーキテクチャ、モデル容量、プロンプト戦略を問わず観察される。全体として、現在のVLMにおける意味的理解と空間推論の間の体系的な隔たりを明らかにし、将来のマルチモーダルシステムにおけるより強固な幾何学的基盤の必要性を浮き彫りにする。

English

This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.

意味的豊かさか幾何学的推論か？視覚言語モデルの視覚的不変性の脆弱性

Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance

要旨

Support