상상적 지각 토큰이 멀티모달 언어 모델의 공간 추론을 향상시킨다

초록

비전 언어 모델(VLM)은 다양한 작업에서 뛰어난 성능을 보이지만, 중요한 정보가 직접 관찰되지 않는 상황에서의 공간 추론에는 여전히 어려움을 겪는다. 이러한 문제 중 상당수는 상상적 지각, 즉 보이지 않는 시점에서 무엇이 보일지 추론하거나, 폐색된 공간을 통과하는 경로를 추적하거나, 부분적인 관찰을 일관된 공간 표현으로 통합하는 능력을 요구한다. 본 연구에서는 상상 지각 토큰(IPT)을 제안한다. 이는 관찰된 입력과의 일관성을 유지하면서, VLM이 대체 공간 구성 하에서 지각할 내용을 외부화하는 중간 지각 표현이다. 이러한 능력을 연구하기 위해, 우리는 시점 취하기(PET), 경로 추적(PT), 다중 시점 계수(MVC)의 세 가지 과제를 설계하고, 실측 상상, 정답, 평가 기준을 포함한 약 20,000개의 예제로 구성된 데이터셋을 구축한다. 통합 VLM인 BAGEL을 백본으로 사용한 결과, IPT 지도 학습은 공간 추론을 지속적으로 개선하며, 추론 시점에 이미지를 생성하지 않더라도 종종 텍스트형 사고 사슬 학습보다 우수한 성능을 보인다. MVC에서는 IPT가 정확도를 3.4% 향상시켰으며, PT에서는 강력한 폐쇄형 모델과 경쟁력 있는 성능을 달성했다. 또한 IPT와 레이블 전용 지도 학습을 결합하면 추가적인 성능 향상이 나타나는 반면, 텍스트형 사고 사슬은 성능을 상당히 저하시킬 수 있음을 발견했는데, 이는 공간 계산이 언어를 통해 강제될 때 양식 불일치가 발생함을 시사한다. 종합적으로 IPT는 관찰되지 않은 공간 구조에 대한 추론을 위한 원칙적 지도 신호를 제공하며, 해석 가능한 중간 표현을 생성하면서 일반화 능력을 향상시킨다.

English

Vision language models (VLMs) excel at many tasks but still struggle with spatial reasoning when critical information is not directly observable. Many such problems require imaginative perception: inferring what would be seen from an unseen viewpoint, tracing paths through occluded spaces, or integrating partial observations into a coherent spatial representation. We introduce Imaginative Perception Tokens (IPT), intermediate perceptual representations that externalize what a VLM would perceive under alternative spatial configurations while remaining consistent with the observed input. To study this capability, we formulate three tasks, Perspective Taking (PET), Path Tracing (PT), and Multiview Counting (MVC), and construct datasets of approximately 20K examples with ground truth imaginations, answers, and evaluation benchmarks. Using the unified VLM BAGEL as the backbone, IPT supervision consistently improves spatial reasoning and often outperforms textual chain of thought training, even without generating images at inference time. On MVC, IPT improves accuracy by 3.4% and achieves competitive performance with strong closed-source models on PT. We further find that combining IPT and label-only supervision yields additional gains, whereas textual chain of thought can substantially degrade performance, suggesting a modality mismatch when spatial computation is forced through language. Overall, IPT provides a principled supervision signal for reasoning about unobserved spatial structure, improving generalization while producing interpretable intermediate representations.