의미적 풍부함인가, 기하학적 추론인가? 시각 언어 모델의 시각 불변성 취약성

초록

본 연구는 기본적인 기하학적 변환 하에서 최신 시각-언어 모델(VLM)이 보이는 근본적인 취약성을 조사한다. 현대 VLM은 정규 방향에서의 객체 인식이나 복잡한 장면 설명과 같은 의미론적 과제에서는 뛰어난 성능을 보이지만, 단순한 회전, 크기 조절, 항등 변환 하에서 객체 식별을 안정적으로 수행하는 데 필요한 강건한 공간 불변성 및 등변성을 결여하는 등 보다 근본적 수준에서 체계적인 실패를 보인다. 우리는 상징적 스케치, 자연 사진, 추상 미술 등 다양한 시각 영역에 걸친 체계적 평가를 통해 이러한 한계를 입증한다. 의미론적 내용이 희박해질수록 성능이 급격히 하락하며, 이러한 현상은 아키텍처, 모델 규모, 프롬프트 전략에 관계없이 관찰된다. 종합적으로, 우리의 결과는 현재 VLM의 의미론적 이해와 공간 추론 사이에 존재하는 체계적 격차를 드러내며, 향후 다중모달 시스템에서 강화된 기하학적 기반의 필요성을 강조한다.

English

This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.

의미적 풍부함인가, 기하학적 추론인가? 시각 언어 모델의 시각 불변성 취약성

Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance

초록

Support