语义丰富性与几何推理之争：视觉语言模型视觉不变性的脆弱本质

摘要

本研究揭示了先进视觉语言模型在基础几何变换下的根本脆弱性。尽管现代VLM在语义任务（如识别标准方向物体和描述复杂场景）中表现出色，但在更基础的层面上却存在系统性缺陷：缺乏可靠判断物体在简单旋转、缩放和恒等变换下身份识别所需的稳健空间不变性与等变性。我们通过对符号草图、自然照片和抽象艺术等多维视觉领域的系统评估证实了这一局限。当语义内容趋于稀疏时，模型性能急剧下降，且该现象普遍存在于不同架构、模型容量及提示策略中。总体而言，我们的研究结果揭示了当前VLM在语义理解与空间推理之间存在的系统性差距，凸显了未来多模态系统加强几何基础的必要性。

English

This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.

语义丰富性与几何推理之争：视觉语言模型视觉不变性的脆弱本质

Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance

摘要

Support