基于心理意象模拟的视觉-语言模型视角感知推理

摘要

我们提出了一种通过心理意象模拟实现视觉-语言模型（VLMs）视角感知推理的框架。视角转换，即从不同视点感知环境或情境的能力，是衡量人类水平视觉理解的关键基准，对于环境交互及与自主代理的协作至关重要。尽管VLMs在空间推理方面取得了进展，但近期研究表明，现代VLMs在视角感知推理能力上显著不足，且表现出强烈的自我中心解释倾向。为缩小VLMs与人类感知之间的差距，我们聚焦于心理意象的作用，即人类通过抽象表征感知世界，从而促进视角转换。受此启发，我们提出了一个名为抽象视角转换（APC）的视角感知推理框架，该框架有效利用视觉基础模型，如目标检测、分割和方向估计，构建场景抽象并实现视角变换。我们在合成和真实图像基准上的实验表明，与多种VLMs相比，采用我们的框架在视角感知推理方面取得了显著提升，进一步超越了经过微调的空间推理模型和基于新视角合成的方法。

English

We present a framework for perspective-aware reasoning in vision-language models (VLMs) through mental imagery simulation. Perspective-taking, the ability to perceive an environment or situation from an alternative viewpoint, is a key benchmark for human-level visual understanding, essential for environmental interaction and collaboration with autonomous agents. Despite advancements in spatial reasoning within VLMs, recent research has shown that modern VLMs significantly lack perspective-aware reasoning capabilities and exhibit a strong bias toward egocentric interpretations. To bridge the gap between VLMs and human perception, we focus on the role of mental imagery, where humans perceive the world through abstracted representations that facilitate perspective shifts. Motivated by this, we propose a framework for perspective-aware reasoning, named Abstract Perspective Change (APC), that effectively leverages vision foundation models, such as object detection, segmentation, and orientation estimation, to construct scene abstractions and enable perspective transformations. Our experiments on synthetic and real-image benchmarks, compared with various VLMs, demonstrate significant improvements in perspective-aware reasoning with our framework, further outperforming fine-tuned spatial reasoning models and novel-view-synthesis-based approaches.

基于心理意象模拟的视觉-语言模型视角感知推理

Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

摘要

Support