想像的知覚トークンはマルチモーダル言語モデルの空間推論を向上させる

要旨

視覚言語モデル（VLM）は多くのタスクで高い性能を示すものの、重要な情報が直接観測できない状況での空間推論には依然として課題がある。そのような問題の多くは、想像的な知覚、すなわち未観測の視点から見えるものを推論したり、遮蔽された空間を通る経路を追跡したり、部分的な観測を一貫した空間表現に統合する能力を必要とする。本稿では、観測入力との整合性を保ちつつ、VLMが代替的な空間配置のもとで知覚するであろう内容を外部化する中間的な知覚表現である「想像的知覚トークン（Imaginative Perception Tokens, IPT）」を提案する。この能力を研究するため、我々は3つのタスク（視点取得（PET）、経路追跡（PT）、多視点計数（MVC））を策定し、正解の想像結果、解答、評価ベンチマークを含む約2万個のデータセットを構築した。統一型VLMであるBAGELをバックボーンとして用いた場合、IPTによる教師信号は空間推論を一貫して改善し、推論時に画像を生成しなくても、テキストによる連鎖思考（chain-of-thought）学習を上回ることが多い。MVCではIPTにより精度が3.4%向上し、PTでは強力なクローズドソースモデルと競争力のある性能を達成する。さらに、IPTとラベルのみの教師信号を組み合わせると追加の利得が得られる一方、テキストによる連鎖思考は性能を著しく低下させることがあり、空間計算を言語に強制する際のモダリティの不一致を示唆している。全体として、IPTは未観測の空間構造に関する推論のための原理的な教師信号を提供し、解釈可能な中間表現を生成しつつ汎化性能を向上させる。

English

Vision language models (VLMs) excel at many tasks but still struggle with spatial reasoning when critical information is not directly observable. Many such problems require imaginative perception: inferring what would be seen from an unseen viewpoint, tracing paths through occluded spaces, or integrating partial observations into a coherent spatial representation. We introduce Imaginative Perception Tokens (IPT), intermediate perceptual representations that externalize what a VLM would perceive under alternative spatial configurations while remaining consistent with the observed input. To study this capability, we formulate three tasks, Perspective Taking (PET), Path Tracing (PT), and Multiview Counting (MVC), and construct datasets of approximately 20K examples with ground truth imaginations, answers, and evaluation benchmarks. Using the unified VLM BAGEL as the backbone, IPT supervision consistently improves spatial reasoning and often outperforms textual chain of thought training, even without generating images at inference time. On MVC, IPT improves accuracy by 3.4% and achieves competitive performance with strong closed-source models on PT. We further find that combining IPT and label-only supervision yields additional gains, whereas textual chain of thought can substantially degrade performance, suggesting a modality mismatch when spatial computation is forced through language. Overall, IPT provides a principled supervision signal for reasoning about unobserved spatial structure, improving generalization while producing interpretable intermediate representations.