哪种预训练范式更有利于空间智能？视觉语言模型与视频生成模型的实证比较

摘要

空间智能需要能够同时捕捉物理世界中语义对象与几何结构的视觉表征。为此，目前有两种主要的预训练范式被广泛用作基础骨干：视觉-语言模型（VLM）利用语言监督将视觉观测与语义概念对齐，而视频生成模型（VGM）则通过随时间演变的视觉世界进行学习。然而，这两种预训练范式究竟哪种能为空间智能提供更优的表征基础，目前尚不明确。本文首次系统性地对VLM和VGM在空间智能三个代表性维度（语义标注、实例分组、三维几何预测）上进行了冻结特征探测研究。通过轻量级探测模块，我们的框架能够对这两类模型家族中已编码在冻结表征中的信息进行受控比较。实验结果表明两者具有显著的互补性：VLM在语义标注和实例分组方面表现更优，而VGM则为稠密几何和相机运动提供了更易获取的信号。此外，对两者进行简单融合得到的表征在几何与语义任务上均表现优异，这为通过有效整合两类模型家族的特征来构建更强的空间智能骨干指明了有前景的方向。我们的代码已开源：https://github.com/om-ai-lab/Probing-VLM-VGM

English

Spatial intelligence requires visual representations that capture both semantic objects and geometric structure in the physical world. To support this, two major pre-training schemes are now widely used as foundation backbones: Vision-Language Models (VLMs), which use language supervision to align visual observations with semantic concepts, and Video Generation Models (VGMs), which learn from temporally evolving visual worlds. However, it still remains unclear which pre-training scheme provides a better representation substrate for spatial intelligence. In this paper, we present the first systematic frozen-feature probing study of VLMs and VGMs across three representative axes of spatial intelligence: semantic tagging, instance grouping, and 3D geometry prediction. Using the lightweight probe, our framework enables a controlled comparison of what information is already encoded in frozen representations from two model families. Experimental results reveal a clear complementarity: VLMs are stronger at semantic tagging and instance grouping, while VGMs provide more accessible signals for dense geometry and camera motion. Moreover, a naive fusion of the two already yields a representation that excels at both geometry and semantics, suggesting a promising direction for building stronger spatial-intelligence backbones by effectively integrating features from both model families. Our code is available at https://github.com/om-ai-lab/Probing-VLM-VGM{https://github.com/om-ai-lab/Probing-VLM-VGM}.