限られた視点からの空間的メンタルモデリング

要旨

視覚言語モデル（VLM）は、人間のようにわずかな視点から全体のシーンを想像できるだろうか？人間は、見えない空間の内部表現である空間的メンタルモデルを形成し、レイアウト、視点、動きについて推論する。私たちが開発した新しいベンチマーク「MindCube」は、3,268枚の画像にわたる21,154の質問を通じて、この重要なギャップを明らかにし、既存のVLMがほぼランダムな性能を示すことを暴露した。MindCubeを使用して、VLMが位置（認知マッピング）、方向（視点取得）、動的変化（「もしも」の動きに対するメンタルシミュレーション）を表現することで、どれだけ堅牢な空間的メンタルモデルを構築できるかを体系的に評価した。次に、VLMが空間的メンタルモデルを近似するための3つのアプローチを探求した。これには、見えない中間視点、自然言語推論チェーン、認知マップが含まれる。最も大きな改善をもたらしたのは、モデルを共同で訓練してまず認知マップを生成し、その後それに基づいて推論する「マップしてから推論」という相乗的アプローチであった。これらの内部マップを基に推論するようモデルを訓練することで、精度を37.8%から60.8%（+23.0%）に向上させた。さらに強化学習を追加することで、性能を70.7%（+32.9%）まで押し上げた。私たちの重要な洞察は、空間的メンタルモデルの足場を構築し、柔軟な推論プロセスとともに内部構造化された空間表現を積極的に構築・利用することが、観察不可能な空間の理解を大幅に向上させるということである。

English

Can Vision Language Models (VLMs) imagine the full scene from just a few views, like humans do? Humans form spatial mental models, internal representations of unseen space, to reason about layout, perspective, and motion. Our new MindCube benchmark with 21,154 questions across 3,268 images exposes this critical gap, where existing VLMs exhibit near-random performance. Using MindCube, we systematically evaluate how well VLMs build robust spatial mental models through representing positions (cognitive mapping), orientations (perspective-taking), and dynamics (mental simulation for "what-if" movements). We then explore three approaches to help VLMs approximate spatial mental models, including unseen intermediate views, natural language reasoning chains, and cognitive maps. The significant improvement comes from a synergistic approach, "map-then-reason", that jointly trains the model to first generate a cognitive map and then reason upon it. By training models to reason over these internal maps, we boosted accuracy from 37.8% to 60.8% (+23.0%). Adding reinforcement learning pushed performance even further to 70.7% (+32.9%). Our key insight is that such scaffolding of spatial mental models, actively constructing and utilizing internal structured spatial representations with flexible reasoning processes, significantly improves understanding of unobservable space.

限られた視点からの空間的メンタルモデリング

Spatial Mental Modeling from Limited Views

要旨

Support