視覚言語モデルにおける視点認識推論：メンタルイメージシミュレーションを通じて

要旨

我々は、メンタルイメージシミュレーションを通じた視覚言語モデル（VLM）における視点認識推論のフレームワークを提案する。視点取得（perspective-taking）、すなわち環境や状況を別の視点から認識する能力は、環境との相互作用や自律エージェントとの協働に不可欠な、人間レベルの視覚理解の重要な指標である。VLMにおける空間推論の進展にもかかわらず、最近の研究では、現代のVLMが視点認識推論能力を著しく欠いており、自己中心的解釈への強いバイアスを示すことが明らかになっている。VLMと人間の知覚のギャップを埋めるため、我々はメンタルイメージの役割に着目する。人間は、視点の変化を容易にする抽象化された表現を通じて世界を認識する。この動機に基づき、我々はAbstract Perspective Change（APC）と名付けた視点認識推論のフレームワークを提案する。このフレームワークは、物体検出、セグメンテーション、方向推定などの視覚基盤モデルを効果的に活用し、シーンの抽象化を構築し、視点変換を可能にする。合成および実画像ベンチマークにおける実験では、様々なVLMと比較して、我々のフレームワークが視点認識推論において大幅な改善を示し、ファインチューニングされた空間推論モデルや新視点合成ベースのアプローチをさらに上回る結果を得た。

English

We present a framework for perspective-aware reasoning in vision-language models (VLMs) through mental imagery simulation. Perspective-taking, the ability to perceive an environment or situation from an alternative viewpoint, is a key benchmark for human-level visual understanding, essential for environmental interaction and collaboration with autonomous agents. Despite advancements in spatial reasoning within VLMs, recent research has shown that modern VLMs significantly lack perspective-aware reasoning capabilities and exhibit a strong bias toward egocentric interpretations. To bridge the gap between VLMs and human perception, we focus on the role of mental imagery, where humans perceive the world through abstracted representations that facilitate perspective shifts. Motivated by this, we propose a framework for perspective-aware reasoning, named Abstract Perspective Change (APC), that effectively leverages vision foundation models, such as object detection, segmentation, and orientation estimation, to construct scene abstractions and enable perspective transformations. Our experiments on synthetic and real-image benchmarks, compared with various VLMs, demonstrate significant improvements in perspective-aware reasoning with our framework, further outperforming fine-tuned spatial reasoning models and novel-view-synthesis-based approaches.