想像感知標記增強多模態語言模型中的空間推理

摘要

視覺語言模型（VLM）在許多任務中表現優異，但當關鍵資訊無法直接觀測時，仍難以處理空間推理問題。這類問題往往需要想像感知：從未見過的視角推斷所見景物、追蹤穿過遮蔽區域的路徑、或將零散觀測整合為連貫的空間表徵。我們提出「想像感知標記」（IPT），這是一種中介性的感知表徵，能具體呈現 VLM 在替代空間配置下可能感知到的內容，同時保持與觀測輸入一致。為探究此能力，我們設計了三項任務：視角推論（PET）、路徑追蹤（PT）與多視角計數（MVC），並建構約 20,000 筆範例的資料集，包含標準答案的想像表徵、答案與評估基準。以統一 VLM 架構 BAGEL 為骨幹，IPT 監督訊號持續改善空間推理表現，且常優於文字思維鍵訓練，即使在推理階段不產生影像亦然。在 MVC 任務中，IPT 提升準確率 3.4%，並在 PT 任務上達到與強封閉源模型相當的競爭力。我們進一步發現，結合 IPT 與純標籤監督能帶來額外增益，而文字思維鍵卻可能大幅降低效能，這顯示在強迫透過語言進行空間計算時存在模態不匹配。整體而言，IPT 為推理未觀測空間結構提供了具原則性的監督訊號，不僅提升泛化能力，也能產出可解釋的中間表徵。

English

Vision language models (VLMs) excel at many tasks but still struggle with spatial reasoning when critical information is not directly observable. Many such problems require imaginative perception: inferring what would be seen from an unseen viewpoint, tracing paths through occluded spaces, or integrating partial observations into a coherent spatial representation. We introduce Imaginative Perception Tokens (IPT), intermediate perceptual representations that externalize what a VLM would perceive under alternative spatial configurations while remaining consistent with the observed input. To study this capability, we formulate three tasks, Perspective Taking (PET), Path Tracing (PT), and Multiview Counting (MVC), and construct datasets of approximately 20K examples with ground truth imaginations, answers, and evaluation benchmarks. Using the unified VLM BAGEL as the backbone, IPT supervision consistently improves spatial reasoning and often outperforms textual chain of thought training, even without generating images at inference time. On MVC, IPT improves accuracy by 3.4% and achieves competitive performance with strong closed-source models on PT. We further find that combining IPT and label-only supervision yields additional gains, whereas textual chain of thought can substantially degrade performance, suggesting a modality mismatch when spatial computation is forced through language. Overall, IPT provides a principled supervision signal for reasoning about unobserved spatial structure, improving generalization while producing interpretable intermediate representations.