시각-언어 모델에서의 관점 인식 추론: 정신적 이미지 시뮬레이션을 통한 접근

초록

우리는 정신적 심상 시뮬레이션을 통해 시각-언어 모델(VLMs)에서의 관점 인식 추론을 위한 프레임워크를 제시한다. 관점 수용 능력, 즉 환경이나 상황을 대체적인 관점에서 인식하는 능력은 인간 수준의 시각적 이해를 위한 핵심 벤치마크로, 환경 상호작용 및 자율 에이전트와의 협업에 필수적이다. VLMs 내 공간 추론의 발전에도 불구하고, 최근 연구는 현대 VLMs이 관점 인식 추론 능력이 크게 부족하며 자기 중심적 해석에 강한 편향을 보인다는 것을 밝혔다. VLMs과 인간 인식 간의 격차를 해소하기 위해, 우리는 인간이 관점 전환을 용이하게 하는 추상적 표현을 통해 세계를 인식하는 정신적 심상의 역할에 주목한다. 이를 바탕으로, 우리는 객체 탐지, 분할, 방향 추정과 같은 시각 기반 모델을 효과적으로 활용하여 장면 추상화를 구성하고 관점 변환을 가능하게 하는 Abstract Perspective Change(APC)라는 관점 인식 추론 프레임워크를 제안한다. 합성 및 실제 이미지 벤치마크에서 다양한 VLMs과 비교한 실험 결과, 우리의 프레임워크가 관점 인식 추론에서 상당한 개선을 보였으며, 미세 조정된 공간 추론 모델 및 새로운 시점 합성 기반 접근법을 능가하는 성능을 입증하였다.

English

We present a framework for perspective-aware reasoning in vision-language models (VLMs) through mental imagery simulation. Perspective-taking, the ability to perceive an environment or situation from an alternative viewpoint, is a key benchmark for human-level visual understanding, essential for environmental interaction and collaboration with autonomous agents. Despite advancements in spatial reasoning within VLMs, recent research has shown that modern VLMs significantly lack perspective-aware reasoning capabilities and exhibit a strong bias toward egocentric interpretations. To bridge the gap between VLMs and human perception, we focus on the role of mental imagery, where humans perceive the world through abstracted representations that facilitate perspective shifts. Motivated by this, we propose a framework for perspective-aware reasoning, named Abstract Perspective Change (APC), that effectively leverages vision foundation models, such as object detection, segmentation, and orientation estimation, to construct scene abstractions and enable perspective transformations. Our experiments on synthetic and real-image benchmarks, compared with various VLMs, demonstrate significant improvements in perspective-aware reasoning with our framework, further outperforming fine-tuned spatial reasoning models and novel-view-synthesis-based approaches.

시각-언어 모델에서의 관점 인식 추론: 정신적 이미지 시뮬레이션을 통한 접근

Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

초록

Support